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I.    PREFACE 

The  need  for  speed  accompanied  by  reliability  has  driven  many  advances  in  machine 
design.  The  history  of  computing  is  replete  with  examples — many  from  scientific 
fields — where  necessity  became  the  impetus  for  faster,  more  reliable  machinery. 
Without  exception,  history  and  past  designs  have  played  key  roles  in  the  invention 
of  new  equipment.  The  maturity  of  mechanical  calculator  design  was  foundational 
in  the  construction  of  electronic  computers.  Today's  multiprocessor  computers  are 
extensions  of  uniprocessor  machines  and  include  technology  developed  by  our  tele- 
phone industry.  Many  well-worn  tools  and  lessons  from  the  past  can  be  applied. 
Many  new  ideas  must  be  put  to  the  test.  This  thesis  is  about  applying  old  principles 
and  evaluating  new  tools  and  equipment. 

A.   A  SURVEY  OF  COMPUTING  MACHINERY 

Nothing  is  more  important  than  to  see  the  sources  of  invention,  which  are, 
in  my  opinion,  more  interesting  than  the  inventions  themselves. 

—  GOTTFRIED  WILHELM  LEIBNIZ  (1646-1716) 

1.    Beginnings 

The  history  of  mathematics  and  computing  is  as  old  as  civilization.  Tools 
like  the  abacus  have  been  used  to  simplify  arithmetic  problems.  VVilhelm  Schickhard 
(1592-1635),  Blaise  Pascal  (1623-1662),  and  Gottfried  Wilhelm  Leibniz  designed  and 
built  mechanical,  gear-driven  calculators.  The  latest  of  these  was  essentially  a  four- 
function  calculator.  By  the  mid-1800s,  Charles  Babbage  had  designed  his  Difference 
Engine  and  proceeded  to  the  more  advanced  Analytical  Engine.  These  machines  were 


never  completed  (at  least  not  to  the  grand  scale  that  Babbage  planned),  but  the  basic 
design  of  the  Analytical  Engine  lies  at  the  heart  of  any  modern  computer.  Consider 
his  motivation. 

The  follouring  example  was  frequently  cited  by  Charles  Babbage  (1792-1871) 
to  justify  the  construction  of  his  first  computing  machine,  the  Difference  Engine 
[Ref.  1],  In  1794  a  project  was  begun  by  the  French  government  under  the  direction 
of  Baron  Gaspard  de  Prony  (1755-1839)  to  compute  entirely  by  hand  an  enormous 
set  of  mathematical  tables.  Among  the  tables  constructed  were  the  logarithms  of 
the  natural  numbers  from  1  to  200,000  calculated  to  19  decimal  places.  Comparable 
tables  were  constructed  for  the  natural  sines  and  tangents,  their  logarithms,  and  the 
logarithms  of  the  ratios  of  the  sines  and  tangents  to  their  arcs.  The  entire  project 
took  about  2  years  to  complete  and  employed  from  70  to  100  people.  The  mathemat- 
ical abilities  of  most  of  the  people  involved  were  limited  to  addition  and  subtraction. 
A  small  group  of  skilled  mathematicians  provided  them  with  their  instructions.  To 
minimize  errors,  each  number  was  calculated  twice  by  two  independent  human  cal- 
culators and  the  results  were  compared.  The  final  set  of  tables  occupied  17  large 
folio  volumes  (which  were  never  published,  however).  The  table  of  logarithms  of  the 
natural  numbers  alone  was  estimated  to  contain  about  8  million  digits. 

This  quote,  from  Hayes  [Ref.  2  :  p.  1],  helps  to  explain  why  computers 
exist  and  shows  some  of  the  incentive  for  making  them  better.  Computing  ma- 
chinery is  designed  for  speed  and  reliability.  A  computer's  "performance"  should 
be  measured  against  both  of  these  components.  Speed  normally  receives  the  most 
attention.  Reliability,  by  whatever  label  you  choose  to  give  it,  rarely  receives  due 
(and/or  timely)  attention.  Too  often  errors  and  issues  of  correctness  receive  careful 
consideration  in  reactive — not  proactive — situations.  Kahan  says,  "The  Fast  drives 
out  the  Slow  even  if  the  Fast  is  wrong"  [Ref.  3:    p.  596]. 

The  correctness  side  of  performance  is  a  much  tougher  game;  and  reliability 
can  be  a  fairly  subjective  matter.  Often  we  pursue  solutions  that  are  "good  enough" 
(and  this  cannot  always  be  defined).  Time,  on  the  other  hand,  has  well-defined  units 
and  the  standards  for  measuring  time  enjoy  a  history  as  old  as  the  first  sunrise.  The 
ease  with  which  the  programmer  can  access  the  machine's  clock  makes  measurements 
of  this  side  of  performance  somewhat  easier. 
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Figure  1.1:  Technologies  and  Computing  Speed 


Industry  demands  fast  machines  because  "time  is  money"  and  speed  alone 
can  make  difficult,  time-consuming  problems  tolerable.  Without  doubt,  the  speed 
of  a  processor  and  execution  time  are  important  performance  considerations.  But 
speed  is  partly  dependent  upon  technology.  Babbage's  designs  represented  quite  an 
advance,  but  they  could  not  be  realized  in  his  day.  Technology  can  determine  which 
designs  succeed,  and  to  what  extent.  Figure  1.1  compares  several  recent  technologies 
using  speed  (measured  in  operations  per  second)  as  the  yardstick.  The  data  for  this 


illustration  was  taken  from  Hayes  [Ref.  2:   p.  9].  As  the  figure  indicates,  it  was  nearly 
a  century  after  Babbage's  work  when  major  technological  advances  came  about. 

2.  Electricity 

Significant  gains  in  speed  were  made  possible  when  electricity  could  be  used 
in  computer  engineering.  The  United  States  census  of  1890  employed  punched  cards 
that  were  read  using  electricity  and  light.  Herman  Hollerith  (1860-1929),  the  de- 
signer of  these  cards,  formed  a  company  that  would  later  join  others  and  (in  1924) 
take  on  the  name  International  Business  Machines  Corporation.  Punched  paper  tape 
was  later  used  by  IBM  in  the  Harvard  Mark  I,  a  general-purpose  electromechani- 
cal computer  designed  by  Howard  Aiken  (1900-1973).  In  the  late  1930s,  at  Iowa 
State  University,  John  V.  Atanasoff  was  creating  a  special-purpose  machine  to  solve 
systems  of  linear  equations.  He  is  credited  with  "the  first  attempt  to  construct  an 
electronic  computer  using  vacuum  tubes"  [Ref.  2:    p.  16]. 

In  1943,  J.  Presper  Eckert  and  John  W.  Mauchly  began  work — at  the  Uni- 
versity of  Pennsylvania — to  direct  the  creation  of  "the  first  widely  known  general- 
purpose  electronic  computer".  The  Electronic  Numerical  Integrator  and  Calculator 
(ENIAC)  project  was  funded  by  the  U.  S.  Army  Ordnance  Department.  The  30-ton 
machine  was  completed  in  1946.  It  held  more  than  18,000  vacuum  tubes.  It  could 
perform  a  ten-digit  multiplication  in  three  milliseconds,  three  orders  of  magnitude 
faster  than  the  Harvard  Mark  I.  [Ref.  2:    pp.  17-18] 

3.  First  Generation  Computers 

From  Babbage's  Analytical  Engine  to  ENIAC,  computer  architectures  held 
data  and  programs  in  separate  memories.  In  1945,  John  von  Neumann  (1903-1957) 
proposed  the  stored-program  concept  (i.e.,  programs  and  data  could  be  stored  in 
the  same  memory  unit).  The  Hungarian-born  mathematician's  involvement  in  the 


ENIAC  project  is  not  remembered  by  many,  but  the  "von  Neumann  architecture" 
has  become  commonplace.  In  fact,  it  "has  become  synonymous  with  any  computer 
of  conventional  design  independent  of  its  date  of  introduction1'  [Ref.  2  :  p.  31]. 
Hennessy  and  Patterson  [Ref.  3  :  pp.  23-24]  object  to  the  widespread  use  of  this 
term,  claiming  that  Eckert  and  Mauchly  deserved  more  of  the  credit. 

In  1946,  von  Neumann  (and  others)  began  to  design  such  an  architecture 
at  the  Institute  for  Advanced  Studies  (IAS),  Princeton.  This  machine,  now  called 
the  IAS  computer,  is  representative  of  so-called  first- generation  computers  (as  Hayes 
points  out:  "a  somewhat  short-sighted  view  of  computer  history").  The  IAS  machine 
was  roughly  ten  times  faster  than  ENIAC  [Ref.  3:  p.  24].  During  the  1946-1948 
timeframe,  A.  W.  Burks,  H.  H.  Goldstine,  and  John  von  Neumann  wrote  a  series  of 
reports  describing  the  IAS  design  and  programming.  The  advances  and  refinements 
in  computer  design  that  came  out  of  this  period  were  important  and  lasting.  By 
1950,  von  Neumann  and  his  colleagues  had  formed  a  foundation  of  theory  and  design 
worthy  of  advanced  technology.  [Ref.  2:    pp.  19-20] 

4.    Transistors 

The  change  from  vacuum  tube  to  transistor  technology  marked  the  begin- 
ning of  the  "second-generation"  of  computers  (approximately  1955-1964).  Transis- 
tor technology  provided  faster  switching  elements,  but  this  was  not  the  only  change 
of  the  decade.  Many  of  the  plans  of  the  late  forties  and  early  fifties  involved  memory, 
so  it  was  fitting  that  ferrite  cores  and  magnetic  drums  be  used  for  faster  main  mem- 
ories. Changes  such  as  these  led  Hennessy  and  Patterson  to  conclude  that  "cheaper 
computers"  were  the  principal  new  product  of  the  early  1960s  [Ref.  3:    p.  26]. 

Additionally,  machines  began  to  become  more  sophisticated.  The  space  and 
tasks  of  the  central  processing  unit  (CPU)  and  main  memories  were  decentralized 
with  the  advent  of  special-purpose  processors  to  augment  the  CPU  and  special- 


purpose  memories  (e.g.,  registers)  to  augment  the  main  memory.  Finally,  system 
software  was  becoming  a  greater  issue.  Programming  continued  moving  upward, 
away  from  the  machine  level,  and  the  processing  of  batch  jobs  was  becoming  more 
automated.  [Ref.  2:    pp.  31-32] 

5.  Integrated  Circuits 

The  first  integrated  circuit  (IC)  was  introduced  in  1961  [Ref.  4  :  p.  1],  and 
the  use  of  ICs  would  be  among  the  most  significant  advances  evident  in  third- 
generation  computers  (starting  about  1965).  Integrated  circuits  brought  major 
changes  in  cost,  maintenance,  reliability,  and  the  amount  of  real  estate  required. 
Other  than  these  hardware  improvements  (circuits  and  memory),  third-generation 
computing  was  not  easy  to  distinguish  from  that  of  the  second  generation.  There  was 
some  migration  from  hardware  to  software  (e.g.,  microprogramming),  more  special- 
ized and  compartmentalized  CPUs  (e.g.,  pipelining),  and  system  software  continued 
to  advance  (e.g.,  operating  systems  that  could  support  multiprogramming  through 
"time-slicing").  [Ref.  2:    p.  40] 

6.  Instruction  Set  Trade-OfFs 

A  large  part  of  designing  computer  hardware  and  software  involves  analysis 
of  cost-performance  ratios.  Other  than  genuine  advances  in  design  or  technology, 
almost  every  aspect  of  computer  architecture  involves  trade-offs.  There  is  usually 
a  spectrum  of  options  from  which  the  computer  architect  chooses,  and  the  "best" 
solutions  are  not  always  found  near  the  ends  of  the  spectrum.  Performance  can  rarely 
be  optimized  with  respect  to  both  space  and  time,  so  a  balance  must  be  sought.  This 
space-time  conflict  and  others  appear  when  a  designer  must  select  a  sophisticated 
instruction  set,  or  a  very  simple  one,  or  one  of  the  many  options  along  the  spectrum 
between  these  options. 


In  the  late  1970s  and  early  1980s  both  hardware  and  software  became  pro- 
gressively more  sophisticated.  Instructions  became  longer  and  more  complex.  The 
Complex  Instruction  Set  Computer  (CISC)  was  popular.  This  design  has  the  advan- 
tage of  powerful  instructions,  but  the  machine  must  decode  each  instruction  (it  is 
a  binary  code).  The  decoding  process  favors  brevity  because  longer  instructions  re- 
quire more  levels  of  decoding  circuitry.  Nonetheless,  if  the  longer  instructions  could 
carry  enough  meaning,  the  decoding  endeavor  would  be  justified. 

IBM  researchers  uncovered  a  provocative  statistic — 20%  of  the  instruction 
set  was  carrying  80%  of  the  burden  [Ref.  5:  p.  5].  The  instruction  set  had  become 
too  complex.  With  some  help  from  several  researchers  and  IBM,  the  Reduced  In- 
struction Set  Computer  (RISC)  architecture  became  popular.  RISC  machines  admit 
a  smaller  vocabulary,  but  claim  quicker  comprehension.  In  fact,  the  goal  of  the  RISC 
architectures  is  one-cycle  execution  of  the  instructions  [Ref.  5:  pp.  6-7].  Hennessy 
and  Patterson,  both  key  contributors  to  the  RISC  movement,  give  an  indication  of 
the  current  broad  acceptance  of  the  RISC  architecture  [Ref.  3:    p.  190]: 

Prior  to  the  RISC  architecture  movement,  the  major  trend  had  been  highly 
microcoded  architectures  aimed  at  reducing  the  semantic  gap.  DEC,  with  the  VAX, 
and  Intel,  with  the  iAPX  432,  were  among  the  leaders  in  this  approach.  In  1989, 
DEC  and  Intel  both  announced  RISC  products — the  DECstation  3100  (based  on  the 
MIPS  Computer  Systems  R2000)  and  the  Intel  i860,  a  new  RISC  microprocessor. 
With  these  announcements,  RISC  technology  has  achieved  very  broad  acceptance. 
In  1990  it  is  hard  to  find  a  computer  company  without  a  RISC  product  either 
shipping  or  in  active  development. 

Three  major  research  projects  were  central  to  early  RISC  developments.  The  first — 
the  IBM  801 — began  in  the  late  1970s,  under  the  direction  of  John  Cocke.  In  1980, 
David  Patterson  and  his  colleagues  at  the  University  of  California  at  Berkeley  began 
the  RISC-I  and  RISC— II  projects  for  which  the  architecture  is  named.  Finally,  John 
Hennessy  and  others  at  Stanford  University  "published  a  description  of  the  MIPS 
machine"  in  1981.  [Ref.  3:    p.  189] 


7.    Multiprocessors  and  Multicomputers 

The  most  recent  advances  in  the  design  of  computing  machinery  include 
parallel  and  concurrent  architectures.  The  terminology  associated  with  these  ma- 
chines has  been  developing  for  about  twenty-five  years,  but  it  is  still  immature. 
The  terms  "multiprocessor"  and  "multicomputer",  for  instance,  are  sometimes  used 
with  additional  meaning.  C.  Gordon  Bell  proposes  that  an  MIMD  machine  with 
message  passing  and  no  shared  memory  be  called  a  multicomputer.  He  calls  a 
shared-memory  MIMD  machine  a  multiprocessor  [Ref.  6:  p.  1092].  This  termi- 
nology seems  to  be  on  the  way  to  acceptance,  and  it  seems  useful  in  giving  a  general 
characterization  to  many  systems,  but  it  lacks  the  sort  of  precision  that  may  be 
necessary. 

First,  the  word  "computer"  usually  carries  many  expectations  with  it.  From 
a  computer,  we  expect  things  like  input  and  output  facilities,  peripheral  devices,  and 
so  on.  These  are  things  that  a  node  on  a  typical  "multicomputer"  does  not  always 
possess.  A  "processor"  is  just  the  opposite.  It  might  be  just  about  any  sort  of 
processor  and  we  are  cautious  about  attaching  any  expectations  to  the  term.  Many 
processors  are  special-purpose  machines,  but  (more  substantial)  central  processing 
units  and  arithmetic  logic  units  are  also  numbered  among  processors.  The  terms 
"computer"  and  "processor"  are  not  precise. 

Secondly,  by  automatically  associating  Flynn's  taxonomy,  memory  mod- 
els (e.g.,  shared,  distributed),  and  other  things  with  a  terminology,  we  reduce  their 
importance  and  hide  them  behind  the  term.  By  using  the  term  "multicomputer", 
without  careful  definition  up  front,  we  run  the  risk  of  forgetting  that  we  are  talking 
about  an  MIMD  machine  that  uses  message  passing  and  has  no  shared  memory.  Ad- 
ditionally, this  terminology — packed  with  expectations — ignores  an  entire  spectrum 
of  very  real  possibilities.  Are  we  saying  that  a  machine  cannot  employ  a  combination 


of  shared  and  distributed  memory?  Using  this  terminology,  how  would  we  say  that 
the  memory  available  to  each  node  of  a  given  system  was  30  percent  shared  and  70 
percent  local  (distributed)? 

Nevertheless,  the  terms  have  some  use,  provided  we  don't  expect  too  much 
of  them.  After  all,  we  distinguish  cars  from  trucks  in  everyday  conversation  with 
reasonably  little  confusion.  But — in  the  same  way  that  it  is  not  prudent  to  assume 
that  "car"  implies  a  vehicle  equipped  with  a  V-8  engine  and  four  doors — we  should 
be  careful  to  guard  against  packing  too  many  specifics  and  expectations  into  the 
terms  "multiprocessor"  and  "multicomputer."  For  this  reason,  the  terms  multipro- 
cessor and  multicomputer  are  used  almost  interchangeably  in  this  work.  A  conscious 
effort  is  made  to  support  them  with  a  clear  description  of  the  memory  paradigm, 
communications  facilities,  and  so  on. 

Bell's  terminology  identifies  the  systems  used  in  this  work  (iPSC/2  and 
transputer  networks)  as  multicomputers.  Nevertheless,  I  often  use  the  term  "mul- 
tiprocessor" to  identify  a  system  with  more  than  one  processor  (such  as  the  ones 
described  in  Chapter  V  and  Appendix  B).  That  is,  multiprocessor  means  nothing 
more  than  the  expected  combination  of  "multi"  with  "processor."  To  forestall  confu- 
sion, the  rest  of  the  thesis  pertains  to  distributed  memory  machines  that  use  message 
passing  to  communicate  instructions  and  data  between  nodes. 

8.    Uniprocessors  and  Multiprocessors 

At  the  chip  level,  multiprocessor  systems  resemble  their  single-processor 
predecessors.  Experience  (e.g.,  telephone  industry,  electronic  technology)  and  a  foun- 
dation of  theory  and  design  (e.g.,  von  Neumann's  work,  network  theory)  are  distinct 
benefits  in  the  development  of  equipment  and  techniques  for  distributed  and  parallel 
computing.  From  a  system  perspective,  though,  the  concurrent  use  of  more  than  one 
processor  creates  a  fundamentally  different  environment. 


Uniprocessor  systems  differ  substantially  from  multiprocessors  and  multi- 
computers  in  their  ability  to  access  data  without  competition.  In  the  presence  of 
more  than  one  processor — regardless  of  memory  model — there  is  a  need  to  coordinate 
requests  for  data.  This  means  that  the  multicomputer  must  accommodate  interpro- 
cessor  communications.  The  nodes  of  a  multiprocessor  system  must  work  together 
efficiently  to  justify  the  cost  of  the  resulting  system.  Some  parts  of  the  solution  are 
relatively  mature,  but  a  vast  territory — algorithms,  electronic  components,  media 
for  communication,  and  software  engineering  techniques — begs  further  exploration. 

B.   CURRENT  APPROACHES 
1.    Machines 

To  compare  the  capabilities  of  different  machines,  some  method  of  bench- 
marking is  typically  used.  By  timing  the  execution  of  a  certain  program(s)  on  a  given 
machine  we  can  determine  its  performance  for  the  given  problem.  By  comparing  the 
execution  times  for  the  same  problem(s)  on  different  machines,  we  arrive  at  a  notion 
of  their  relative  power.  A  popular  method  for  sizing  up  the  computing  power  of 
a  machine  is  the  LINPACK  benchmarking  program  [Ref.  7].  This  is  essentially  a 
program  involving  the  solution  of  a  dense  system  of  linear  equations. 

Currently,  under  this  LINPACK  test,  the  fastest  machines  in  the  world 
have  surpassed  the  gigaflop  mark  (a  billion  floating-point  operations  per  second). 
Table  1.1,  adapted  from  Dongarra's  report  [Ref.  8:  p.  21],  shows  performance  data. 
The  leftmost  column  of  this  table  gives  the  name  of  the  system  and  the  cycle  time  (in 
parentheses).  The  next  column  contains  p,  the  number  of  processors  used  to  obtain 
the  data  that  is  shown  in  the  four  remaining  columns.  For  most  systems  (e.g.,  the 
Intel  iPSC/860)  the  size  of  the  system  (number  of  processors  used  for  a  given  run) 
can  be  scaled,  so  data  was  reported  for  several  different  system  sizes. 
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TABLE  1.1:  WORLD'S  FASTEST  COMPUTERS 


Computer  (Clock  Rate) 

V 

'max 

"moi 

"1/2 

«  pe ak 

Intel  Delta  (40  MHz) 

512 

11.9 

25000 

7000 

20 

Thinking  Machines  CM-200  (10  MHz) 

2048 

9.0 

28672 

11264 

20 

Intel  Delta  (40  MHz) 

256 

5.9 

18000 

5000 

10 

Thinking  Machines  CM-2  (7  MHz) 

2048 

5.2 

26624 

11000 

14 

Intel  Delta  (40  MHz) 

192 

4.0 

12000 

4000 

7.7 

Intel  Delta  (40  MHz) 

128 

3.0 

12500 

3500 

5 

Intel  iPSC/860  (40  MHz) 

128 

1.9 

8600 

3000 

5 

nCUBE  2  (20  MHz) 

1024 

1.9 

21376 

3193 

2.4 

Intel  Delta  (40  MHz) 

64 

1.5 

8000 

3000 

2.6 

nCUBE  2  (20  MHz) 

512 

.958 

15200 

2240 

1.2 

Intel  iPSC/860  (40  MHz) 

64 

.928 

5750 

2500 

2.6 

Fujitsu  AP1000 

512 

2.251 

25600 

2500 

2.8 

Intel  iPSC/860  (40  MHz) 

32 

.486 

4000 

1500 

1.3 

nCUBE  2  (20  MHz) 

256 

.482 

10784 

1504 

.64 

MasPar  MP-1  (80  ns) 

16384 

.44 

5504 

1180 

.58 

Fujitsu  API 000 

256 

1.162 

18000 

1600 

1.4 

Intel  iPSC/860  (40  MHz) 

16 

.258 

3000 

1000 

.64 

nCUBE  2  (20  MHz) 

128 

.242 

7776 

1050 

.32 

Fujitsu  AP1000 

128 

.566 

12800 

1100 

.71 

Intel  iPSC/860  (40  MHz) 

8 

.132 

2000 

600 

.32 

nCUBE  2  (20  MHz) 

64 

.121 

5472 

701 

.15 

Fujitsu  AP1000 

64 

.291 

10000 

648 

.36 

Intel  iPSC/860  (40  MHz) 

4 

.061 

1000 

400 

.16 

nCUBE  2  (20  MHz) 

32 

.0611 

3888 

486 

.075 

Intel  iPSC/860  (40  MHz) 

2 

.044 

1000 

400 

.08 

nCUBE  2  (20  MHz) 

16 

.0320 

5580 

342 

.038 

Intel  iPSC/860  (40  MHz) 

1 

.024 

750 

.04 

nCUBE  2  (20  MHz) 

8 

.0161 

3960 

241 

.019 

nCUBE  2  (20  MHz) 

4 

.0080 

2760 

143 

.0094 

nCUBE  2  (20  MHz) 

8 

.0040 

1280 

94 

.0047 

nCUBE  2  (20  MHz) 

8 

.0020 

1280 

51 

.0024 
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The  column  labeled  rmax  gives  the  performance  (in  gigaflops)  for  the  largest 
problem  run  on  the  machine.  The  size  of  that  largest  problem  is  indicated  by  nmar, 
where  n  is  the  dimension  of  the  matrix  of  coefficients,  A  E  3RnXn.  The  rii/2  column 
gives  the  problem  size  that  yielded  a  rate  of  execution  that  was  half  of  rmax.  Finally, 
fptak  denotes  the  theoretical  peak  performance  (in  gigaflops)  for  the  machine. 

This  data  indicates  that  Intel  is  the  current  leader — among  companies  in 
the  United  States — of  the  teraflop  race,  so  we  shall  take  a  closer  look  at  their  prod- 
ucts. The  Intel  i860  microprocessor,  together  with  8  megabytes  of  memory,  forms 
one  of  128  nodes  in  the  hypercube-connected  iPSC/860.  This  machine  achieves  per- 
formances of  nearly  two  gigaflops  with  UNPACK.  iPSC  stands  for  intel  Personal 
Supercomputer,  so  this  entry  would  not  appear  to  target  high-end  markets.  The 
most  significant  project  in  supercomputing  at  Intel  today  is  the  Touchstone  project. 

George  E.  Brown,  chairman  of  the  U.  S.  House  Committee  on  Science, 
Space,  and  Technology,  cut  the  ribbon  around  the  Intel  Touchstone  Delta  at  the 
California  Institute  of  Technology  on  May  31,  1991  [Ref.  9  :  p.  96].  The  Delta 
is  a  mesh  of  528  nodes.  Each  node  holds  an  i860  processor  and  16  megabytes  of 
memory.  This  machine  has  reached  the  11.9  gigaflop  mark  with  the  UNPACK 
benchmark.  The  closest  competitor  in  the  world  would  appear  to  be  the  CM-200 
from  Thinking  Machines,  Inc.  This  2,048-node  machine  benchmarks  at  9  gigaflops 
[Ref.  8:  p.  21].  The  Touchstone  program  is  not  over.  Intel  plans  to  follow  the  Delta 
with  the  Touchstone  Sigma.  Sigma  will  have  at  least  2,048  nodes,  each  consisting  of 
the  i860  XP  processor  (about  twice  as  powerful  as  the  i860).  [Ref.  9:   p.  96] 

The  European  high-performance  computing  market  favors  the  transputer, 
a  microprocessor  made  by  INMOS.  The  New  York  Times  of  May  31,  1991  lists  one 
German  company,  Parsytec,  and  seven  American  companies — Bolt,  Beranek,  and 
Newman  (BBN),  Cray  Research,  IBM,  Intel,  NCube,  Thinking  Machines,  and  Tera 
Computer — that  have  entered  the  teraflop  race  [Ref.  10].  Parsytec  expects  their  GC 
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to  provide  "the  necessary  2  to  3  orders  of  magnitude  increase  in  performance  above 
existing  supercomputers  to  give  scientists  the  tool  to  attack  their  Grand  Challenges." 
[Ref.  10:    p.  1] 

Parsytec  envisions  a  system  of  up  to  16,384  processing  elements  based  upon 
the  INMOS  T9000  transputer  (see  Chapter  VII).  This  would  give  the  Parsytec  ma- 
chine 25-megaflop  nodes  capable  of  communications  bandwidths  near  100  megabytes 
per  second.  The  Parsytec  design  begins  with  a  cluster  of  seventeen  T9000  processors 
(sixteen  primary  processors  and  the  seventeenth  for  backup)  and  four  CI 04  worm- 
hole  routing  chips.  From  four  clusters,  the  company  will  craft  a  GigaCube  (or  simply 
Cube)  of  G4  processors  (not  counting  redundant  elements  in  the  design).  The  GC- 
1  would  represent  a  one  gigaflop  system  and  this  would  be  the  building  block  for 
greater  systems  (lesser  systems  can  initially  be  equipped  with  16,  32,  or  48  nodes). 
The  processors  in  a  single  (Giga)Cube  are  arranged  in  a  three-dimensional  (4x4x4) 
grid.  [Ref.  10] 

2.    Programming  Practice 

Software  engineering  for  multiprocessor  systems  is  similar  to  contemporary 
practices  for  sequential  machines.  The  programming  languages  used  in  this  work 
provide  normal  C  libraries  with  additional  functions  to  accommodate  interprocessor 
communications.  The  systems  typically  provide  a  loader  designed  to  load  executable 
code  onto  the  (host  and)  nodes  according  to  the  programmer's  instructions.  Some 
loaders  require  that  the  same  code  be  loaded  onto  each  of  the  nodes.  Other,  more 
flexible,  loaders  allow  the  user  to  specify  which  program  should  be  loaded  onto  each 
node.  The  Logical  Systems  C  network  loader,  LD-NET  is  such  a  program.  It  takes 
a  Network  Information  File  (NIF),  describing  the  network's  interconnections  and 
loading  instructions,  as  input  and  performs  the  loading  process. 
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C.  THE  FUTURE 
1.    Crossroads 

Parallel  and  distributed  computing  is  in  the  early  years  of  a  very  promising 
lifetime.  We  should  give  careful  consideration  to  the  direction  that  the  field  should 
assume.  Lacking  years  of  experience,  I  will  lean  on  the  writings  and  advice  of  others 
while  trying  to  peer  a  little  ways  into  the  future  of  parallel  computing.  A  regrettable 
side  effect  of  this  decision  is  that  this  section  seems  to  consist  primarily  of  the 
observations  and  opinions  of  others.  Notwithstanding  the  many  quotations,  I  believe 
that  several  important  ideas  are  exposed. 

This  business  is  filled  with  a  combination  of  old,  established  ideas  and 
proven  techniques.  It  also  holds  new  questions  and  opportunities.  Hamming's  ad- 
vice [Ref.  11:    p.  14]  seems  most  fitting  in  this  situation: 

Now  I  see  constantly  attempts  to  force  new  ideas  to  old  molds.  That  is  fre- 
quently sensible:  How  can  I  make  sense  of  what  I'm  seeing  compared  to  what  I  did 
before?  But  also  one  must  ask,  ''Am  I  seeing  something  fundamentally  new?"  That 
part  many  people  will  not  try.  You  cannot  afford  to  make  everything  brand  new  and 
not  connect  anything  together  with  existing  ideas,  nor  can  you  try  to  make  every- 
thing fit  into  preconceived  categories.  Some  combination  of  the  two  is  necessary. 

We  limped  through  the  transistor  revolution  and  the  computer  revolution, 
which  are  connected  with  the  bandwidth  revolution;  they  are  all  connected  together. . . 
You  have  to  abandon  old  ideas  when  you  get  an  order  of  magnitude  of  change.  .  .  . 

-  RICHARD  W.  HAMMING 

Developments  in  scientific  computing  today  make  Dr.  Hamming's  thoughts 
especially  timely.  The  field  needs  to  establish  a  strategy;  a  direction  that  will  lead 
from  its  present  immaturity  to  a  place  of  fulfilling  its  potential.  Kenneth  Wilson 
proposes  Grand  Challenges  for  computational  science  that  may  help  to  establish  this 
strategy  [Ref.  12]. 
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2.    Grand  Challenges 

Wilson  identifies  three  modes  of  scientific  activity:  theoretical,  experi- 
mental, and  computational.  He  defines  these  areas,  claiming  that — with  today's 
supercomputers — the  most  recent  science  (computational)  is  becoming  more  signifi- 
cant. So  significant,  in  fact,  that  "long  experience  or  professional  training  is  required 
to  be  successful  in  computational  science  at  the  supercomputer  level,  making  it  ap- 
propriate to  think  of  computational  science  as  both  a  separate  mode  of  scientific 
endeavor  and  new  discipline."  [Ref.  12:    p.  172] 

Wilson  is  careful  to  distinguish  computational  science  from  computer  sci- 
ence. He  defines  computer  science  as  the  business  of  addressing  "generic  intellectual 
challenges  of  the  computer  itself"  and  characterizes  computational  science  as  being 
tailored  to  specific  applications  areas  (with  serious  training  in  the  application  disci- 
pline) [Ref.  12:  p.  172].  To  advance  computational  science,  W'ilson  recommends  a 
quantitative  approach  with  clear  strategies  [Ref.  12:    p.  173]: 

The  major  future  opportunities  for  benefits  of  supercomputers  to  basic  re- 
search should  be  identified  without  the  existing  compromises,  but  presented  as  chal- 
lenges to  be  overcome  with  the  many  obstacles  to  success  clearly  explained.  The 
compromises  and  inadequacies  of  current  computations  need  to  be  described  and 
the  level  of  advances  required  to  overcome  these  inadequacies  discussed.  Further- 
more, a  few  key  areas  with  both  extreme  difficulties  and  extraordinary  rewards  for 
success  should  be  labelled  as  the  "Grand  Challenges  of  Computational  Science". 
Two  examples  are  electronic  structure  and  turbulence.  No  easy  promises  of  success 
in  Grand  Challenges  should  be  offered.  Instead,  computational  scientists  should  be 
building  plans  to  assault  the  Grand  Challenges,  pushing  for  the  major  advances 
in  algorithms,  software,  and  technology  that  will  be  required  for  true  progress  to 
be  achieved  in  these  areas.  The  Grand  Challenges  should  define  opportunities  to 
open  up  vast  new  domains  of  scientific  research,  domains  that  arc  inaccessible  to 
traditional  experimental  or  theoretical  modes  of  investigation. 

WTilson  describes  a  few  examples  that  demonstrate  the  limitations  of  exper- 
imental instrumentation  and  the  potential  of  supercomputers.  Weather  prediction, 
astronomy,  materials  science,  molecular  biology,  aerodynamics,  and  quantum  field 
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theory  are  the  six  areas  that  Wilson  chooses  to  make  his  point.  He  describes  these 
areas  in  reasonable  detail  and  briefly  mentions  other  topics.  [Ref.  12:    pp.  175-179] 

a.  Mathematical  Background 

Wilson  stresses  the  need  for  sound  design  practices  and  good  algorithms. 
(To  see  why,  consider  Table  A.l).  Additionally,  he  warns  that  we  should  spend  less 
time  in  awe  of  today's  supercomputing  power  and  admit  that  it  is  terribly  inadequate. 
Modeling  methods  and  sound  mathematical  background  also  appear  in  the  "needs 
improvement"  category.  Wilson  [Ref.  12:    p.  180]  believes  that 

Mathematical  developments  that  relate  to  numerical  computation  are  highly 
important.  Theorems  about  numerical  errors  or  sources  of  error,  exact  solutions 
and  expansions,  existence  and  uniqueness  proofs  and  the  like,  can  make  a  major  dif- 
ference in  establishing  the  credibility  of  a  numerical  computation.  All  too  frequently 
there  is  too  little  mathematical  understanding  backing  up  numerical  simulation. 

b.  Issues  of  Quality 

Wilson  does  not  consider  these  to  be  the  only  problems  facing  com- 
putational scientists.  He  believes  that  quality  is  endangered,  primarily  from  two 
directions  [Ref.  12:    pp.  180-181]: 

•  A  tendency  to  stay  on  the  safe,  easy  side;  not  wandering  far  from  the  position: 
"our  calculation  agrees  with  experiment." 

•  The  quality  of  computational  programs,  measured  against  practical  criteria, 
is  lacking.  The  standards  include  rounding  errors  (e.g.,  catastrophic  cancella- 
tion), overflows,  and  stability  (with  respect  to  input  parameters). 
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c.    Languages 

Wilson  cites  a  number  of  reasons  for  revolutions  in  computer  languages. 
In  particular,  he  believes  that  "Fortran  is  in  the  long-term  the  most  fundamental 
barrier  to  progress"  [Ref.  12:  p.  182].  His  approach  is  realistic  enough  to  recognize 
the  vast  investments  of  scientific  communities  in  Fortran.  The  language  cannot  and 
should  not  be  eliminated  in  a  day.  Nevertheless,  it  has  very  serious  shortcomings. 
Some  problems  could  be  overcome  by  a  Fortran  preprocessor  (the  same  idea  as  the  C 
preprocessor).  Other  problems,  like  lack  of  support  for  abstraction  and  the  unnatural 
exclusion  of  basic  mathematical  symbols  in  the  language,  are  not  solved  as  easily. 
[Ref.  12:    p.  182] 

Wilson  does  not  recommend  a  simple  change  of  language  as  the  solution, 
but  searches  for  deeper  problems.  He  believes  that  the  entire  way  that  computational 
scientists  and  programmers  think  about  and  plan  programs  must  change  as  well. 
After  reading  Wilson's  analysis  of  language  problems,  the  basic  impression  that 
prevails  is  that  we  have  an  urgent  need  for  general-purpose  practices  to  replace 
patchwork,  hit-or-miss,  case-by-case  solutions. 

3.    Generality 

David  Harel  is  also  an  advocate  of  the  need  for  general  purpose  techniques. 
In  the  preface  to  his  book  [Ref.  13 :    p.  viii]  he  warns: 

Curiously,  there  appears  to  be  very  little  written  material  devoted  to  the  sci- 
ence of  computing  and  aimed  at  the  technically  oriented  general  reader  as  well  as 
the  professional.  This  fact  is  doubly  curious  in  view  of  the  abundance  of  precisely 
this  kind  of  literature  in  most  other  scientific  areas,  such  as  physics,  biology,  chem- 
istry and  mathematics,  not  to  mention  humanities  and  the  arts.  There  appears  to 
be  an  acute  need  for  a  technically  detailed,  expository  account  of  the  fundamen- 
tals of  computer  science;  one  that  suffers  as  little  as  possible  from  the  bit/byte  or 
semicolon  syndromes  and  their  derivatives,  one  that  transcends  the  technological 
and  linguistic  whirlpool  of  specifics,  and  one  that  is  useful  both  to  a  sophisticated 


17 


layperson  and  to  a  computer  expert.    It  seems  that  we  have  all  been  too  busy  with 
the  revolution  to  be  bothered  with  satisfying  such  a  need. 


This  idea  is  not  unique.  One  of  the  other  major  proponents  of  general- 
purpose  parallel  computing  is  David  May  of  INMOS.  In  an  invited  lecture  at  the  the 
Transputing  '91  conference  [Ref.  14],  he  highlighted  features  that  general-purpose 
parallel  hardware  should  deliver.  Among  the  important  components  of  a  general 
approach,  May  included  the  following: 

•  Scaling.  Performance  must  scale  with  number  of  processors.  Efficiency  is 
partly  dependent  on  problem  size,  but — with  adequate  problem  size — systems 
of  a  thousand  processors  should  be  within  technological  reach.  Each  processor 
is  expected  to  achieve  10s— 109  flops. 

•  Portability.  This  is  almost  synonymous  with  "general  purpose."  May  empha- 
sizes algorithms  based  upon  features  common  to  many  machines,  and  which 
remain  valid  as  technology  evolves.  He  stresses  that  this  general  purpose  par- 
allel architecture  will  benefit  both  the  computer  designer  and  the  programmer. 
The  designer  will  gain  since  the  market  will  be  somewhat  predictable.  The 
programmer's  code  will  work  on  several  machines  and  hold  a  strong  hope  for 
working  into  future  years. 

To  achieve  these  goals,  May  proposes  several  guidelines.  First,  for  a  message  passing 
system  using  p  processors,  the  nodes  must  be  capable  of  concurrent  computing  and 
communication.  The  interconnection  topology  must  provide  scalable  throughput 
(linear  in  p)  and  bounded  delay,  probably  log(p).  Programs,  May  believes,  should  be 
written  at  as  high  a  level  as  possible  and  make  use  of  many  processes.  The  algorithm 
should  express  the  maximum  possible  parallelism.  Much  of  May's  theory  is  based 
upon  the  structure  of  a  hypercube  interconnection  topology  (or  virtual  hypercube). 
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4.    Projections 

Kenneth  Wilson  makes  a  credible  claim  that  says  parallel  computing  is 
here  to  stay.  His  reasoning  is  based  upon  the  fact  that  mass  production  and  heavy 
competition  are  proven  ingredients  in  keeping  the  cost  of  chips  low.  Rather  than 
summarize,  I  will  quote  his  conclusion  [Ref.  12:    p.  185]: 

Today  a  single  processing  unit  costing  millions  of  dollars  can  still  be  cost- 
effective  but  I  don't  think  this  can  last  very  long,  over  a  period  of  time  (I  cannot 
estimate  how  many  years)  it  seems  likely  that  the  maximum  price  of  a  cost-effective 
processor  will  plunge  to  one  hundred  thousand  dollars,  to  ten  thousand  dollars,  to 
???.  I  cannot  estimate  the  ultimate  equilibrium  price  at  which  this  plunge  will  stop. 

Meanwhile  I  can  find  no  prospects  that  single  supercomputer  processors  speeds 
will  advance  at  anything  like  the  pace  at  which  processor  costs  are  being  reduced, 
even  using  Gallium  Arsenide  or  superconducting  Josephson  junctions. 

The  result  of  this  is  inevitable — overall  advances  at  the  supercomputer  level 
have  to  come  through  parallelism,  namely,  big  increases  in  speed  have  to  come  from 
the  simultaneous  use  of  many  processors  in  parallel. 

David  May  agrees  with  Wilson,  who  states  that  increasingly  complex  com- 
ponents and  faster  clock  speeds  are  not  likely  avenues  of  advancement.  This  makes 
parallel  processing  "technically  attractive."  He  also  agrees  that  mass  production  will 
make  the  most  effective  use  of  design  and  production  facilities.  His  conclusion:  "A 
general  purpose  parallel  architecture  would  allow  cheap,  standard  multiprocessors  to 
become  pervasive."  [Ref.  14] 

May's  prediction  for  1995  includes  processors  capable  of  100  megaflops. 
INMOS  believes  strongly  in  the  idea  of  balancing  computation  and  communication, 
and  May  projects  that  node  throughputs  will  have  reached  500  megabytes  per  second. 
In  1995's  multiprocessor  systems,  he  envisions  teraflop  performance.  By  2000,  May 
projects  "scalable  general  purpose  parallel  computers  will  cover  the  performance 
range  up  to  10n  flops.  Specialised  parallel  computers  will  extend  this  to  1013  flops." 
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D.   OVERVIEW 

This  chapter  has  surveyed  the  (relatively  recent)  history  of  computing,  consid- 
ered the  state-of-the-art,  and  made  a  few  guesses  as  to  the  future.  Additionally,  it 
has  introduced  numerical  and  parallel  computing.  This  serves  as  a  backdrop  for  the 
remainder  of  the  thesis.  Chapter  II  expands  the  background  on  parallel  processing 
and  numerical  methods.  The  latter  provides  a  lead-in  to  the  specific  algorithms  and 
theory  that  appear  in  Chapter  III.  Chapter  IV  introduces  the  parallel  design  and 
methods  used  in  the  work.  A  description  of  the  environment,  tools,  and  equipment 
appears  in  Chapter  V.  Results  and  conclusions  appear  in  Chapters  VI  and  VII. 

Appendices  are  provided  to  keep  the  chapters  concise  and  focused.  The  ap- 
pendix material  operates  on  both  sides  of  that  focus.  Some  of  the  material  is  de- 
signed to  give  sufficient  background  and  the  rest — code  mostly — is  provided  for  more 
in-depth  study.  The  background  material  may  be  obvious  to  some  readers  and  new 
to  others.  I  have  assumed  that  the  reader  has  some  knowledge  of  the  background 
material.  I  do  not  presume  that  the  reader  will  be  familiar  with  the  code. 

To  simplify  the  discussion  we  must  speak  the  same  language.  Appendix  A 
gives  the  basic  terms  and  notation  used  in  the  rest  of  the  thesis.  Next,  we  discuss 
the  machines  used  to  perform  the  work.  While  this  is  the  subject  of  Chapter  V,  a 
more  detailed  account  is  reserved  for  Appendix  B.  Appendix  C  provides  a  general 
background  on  interconnection  topologies.  Emphasis  is  placed  upon  the  hypercube 
connection  scheme.  Appendix  D  describes  the  process  whereby  a  real-world  problem 
is  translated  into  matrix  notation.  Appendix  E  gives  some  information  and  results  for 
communications  performance  in  a  hypercube.  Finally,  Appendix  F  provides  listings 
for  most  of  the  code  used  in  the  research. 
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II.    BACKGROUND 

Mathematics  is  the  door  and  key  to  the  sciences. 

-  ROGER  BACON 

Chapter  I  provided  a  backdrop,  showing  the  state  of  scientific  computing,  es- 
pecially parallel  and  distributed  forms,  today.  In  the  present  chapter,  the  scope 
is  limited  to  material  and  equipment  pertaining  to  this  research.  The  thesis  work 
deals  with  methods  of  conjugate  directions  implemented  upon  two  contemporary 
MIMD  machines.  The  goal  is  to  introduce  the  theory,  machines,  methods,  and  a  few 
peripheral  issues  that  will  be  helpful  as  background  information. 

A.   COMPUTING  WITH  REAL  NUMBERS 

As  illustrated  in  Figure  1.1,  the  speed  of  computing  machinery  has  risen  swiftly 
since  the  1940s.  This  has  often  been  encouraged  by  substantial  advances  in  tech- 
nology. Today's  multiprocessor  machines  seem  to  be  maintaining  the  fast-paced 
growth.  Additionally — although  precision  is  a  less  glamorous  business  than  speed — 
the  accuracy  of  machine  solutions  has  become  more  standard.  This  section  considers 
some  of  the  principal  issues  of  computing  with  finite  approximations  of  real  numbers. 

We  have  observed  that  the  history  of  computing  shows  close  ties  to  science  and 
mathematics.  As  the  design  and  construction  of  computers  becomes  a  more  spe- 
cialized business — mostly  performed  by  electrical  and  computer  engineers — we  still 
find  that  many  of  the  fundamental  requirements  are  related  to  scientific  problems. 
These  problems  typically  involve  mathematics  and  a  significant  amount  of  scientific 
computing  applies  numerical  methods  that  involve  real  numbers.  The  trend  in  com- 
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puter  (hardware  and  software)  design  is  toward  abstraction,  but  from  time  to  time 
we  absolutely  must  understand  and  work  with  the  underlying,  concrete  principles. 

1.    Finite-Precision 

New  problems  are  generated  as  the  speed  of  computing  machinery  improves 
with  each  generation  of  machines.  One  question  to  be  considered  is,  how  reliable 
are  the  machines  and  the  software  that  runs  on  them?  This  is  a  constant  concern 
in  computing.  Many  scientific  problems  involve  continuous  phenomena  in  the  real 
world.  Accordingly,  we  like  to  be  able  to  represent  the  real  numbers,  3R,  within  the 
machine.  But,  lacking  infinite  storage,  this  is  impossible.  There  have  been  several 
more-or-less  reasonable  ideas  and  implementations  of  approximations  to  the  real 
numbers  within  the  limits  of  computer  storage.  Of  these,  the  floating-point  concept 
of  storage  and  arithmetic  enjoys  the  most  widespread  use. 

The  Institute  of  Electrical  and  Electronics  Engineers  (IEEE)  has  established 
the  principal  standards  for  floating-point  representations  and  arithmetic.  These 
standards  make  machine  arithmetic  more  predictable.  Surprisingly,  while  they  exist 
in  much  of  today's  computing  hardware,  the  standards  are  not  widely  understood  by 
practitioners.  Then,  software  and  applications  are  sometimes  formed  in  ignorance. 
The  title  of  David  Goldberg's  paper  [Ref.  15]  speaks  volumes:  "What  Every  Com- 
puter Scientist  Should  Know  About  Floating-Point  Arithmetic."  Goldberg  is  also 
responsible  for  several  other  contributions  describing  floating-point  arithmetic  and 
the  IEEE  standards.  Appendix  A  of  Hennessy  and  Patterson's  book  on  architec- 
ture [Ref.  3]  is  such  a  contribution.  He  gives  a  very  useful  description  of  the  IEEE 
standards  and  instruction  on  how  to  perform  arithmetic  operations  on  machines  that 
adhere  to  the  IEEE  standards. 
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2.   IEEE  754 

Of  the  four  precisions  specified  by  the  IEEE  754-1985  standard,  this  thesis 
uses  the  double  precision  format  most  often  (to  approximate  real  numbers)  so  it 
will  receive  the  most  attention.  In  the  C  programming  language,  these  numbers 
correspond  to  the  type  double.  They  are  floating-point  values  stored  in  eight  bytes 
(64  bits).  The  storage  representation  is  illustrated  as  three  components:  one  sign  bit, 
s;  an  11-bit  exponent,  e;  and  a  52-bit  fraction,  f .  Figure  2.1  shows  an  example.  We 
say  that  e  is  a  biased  exponent.  Both  negative  and  positive  exponents  are  stored  using 
a  range  of  positive  binary  numbers  biased  about  (nearly)  the  middle.  Signifi can d  or 
mantissa  is  the  name  given  to  the  number  (1./).  The  fraction  is  a  packed  form  of 
the  significand.  This  means  that  the  leading  one  of  the  significand  is  implicit.  This 
is  called  a  normalized  number.  [Ref.  16] 

All  IEEE  floating-point  numbers  are  normalized  except  for  the  special  rep- 
resentations when  e  =  00000000000  =  0  or  e  =  11111111111  =  2047.  These  are 
called  denormalized  (or  subnormalized)  numbers.  Only  the  fraction,  /,  of  a  normal- 
ized number  is  stored  [Ref.  3:  p.  A-14].  Figure  2.1  shows  a  representation  of  the 
floating-point  number,  x  =  7.0.  First,  x  is  shown  as  it  would  be  defined  in  a  C 
program.  The  C  address  of  operator,  £z,  is  used  to  indicate  the  address  of  x  in  mem- 
ory. That  is,  somewhere  (namely  &x)  in  memory,  there  are  eight  contiguous  bytes 
that  hold  a  floating-point  representation  of  x  and  (for  illustration  purposes)  we  can 
imagine  the  IEEE  754  double-precision  representation  of  x  as  Figure  2.1  indicates. 

A  standard,  such  as  IEEE  754  (and  the  lesser-known  IEEE  854),  is  not  a 
panacea  for  the  finite-precision  problem  but  it  lends  tremendous  support  to  those 
who  would  scientifically  deal  with  the  problems  of  finite-precision  arithmetic.  Pro- 
grams given  in  the  files  num.sys.h  and  num.sys.c  (in  Appendix  F)  are  of  interest 
to  those  who  would  explore  further.  The  programs  can  demonstrate  that  the  actual 
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double   x  =  7.0; 

< 

1 

0 

10000000001 

1100000000000000000000000000000000000000000000000000 

s       e=  1025 

/  =  .ll2 

Interpretation:               x     = 

{-V)  x  l./a  x2e"1023 

= 

(-1°)  x  1.112  X21025"1023 

= 

1.11a  x  4 

= 

Ilia 

= 

7 

Figure  2.1:  IEEE  754  Representation:  Double  Precision 

order  and  location  of  bits  in  memory  may  not  match  the  representation  of  Fig- 
ure 2.1.  This  reflects  practicalities  concerning  storage  and  transmission  of  bytes  at 
a  very  low  level  in  the  machine.  It  is  perfectly  reasonable  (and  easier)  to  use  the 
common  abstraction  of  Figure  2.1  regardless  of  machine  implementation. 

B.   NUMERICAL  ISSUES 
1.   The  Need 

Consider  the  problem  of  determining  the  area  under  a  bounded  function 
f(x)  over  a  closed  interval  [a,  6].  Numerical  quadrature  (integration)  rules  such  as 
the  Trapezoidal  Rule  or  Simpson's  Rule  are  used  to  arrive  at  an  approximating  (or 
Riemann)  sum  of  many  smaller  areas  within  the  region.    Numerical  methods  are 
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often  used  to  approximate  the  solution  to  a  problem.  This  is  no  trivial  problem.  To 
solve  it  (numerically)  by  anything  other  than  accident,  one  must  first  understand 
the  theory  and  analytical  approach.  Next,  the  problem  can  be  translated  into  an 
algorithm  (a  plan — usually  mathematical  in  nature — for  solving  the  problem  step- 
by-step)  which  can,  in  turn,  be  translated  into  the  sort  of  language  that  a  machine 
understands. 

This  is  a  relatively  simple  approximation  problem  compared  to  the  problem 
of  finding  the  solution  to  a  system  of  500  equations  in  500  unknowns.  Consider  the 
(perhaps  more  realistic)  problem  of  using  numerical  linear  algebra  to  solve  an  elliptic 
partial  differential  equation  like  the  one  presented  in  Appendix  D.  Numerical  con- 
cerns abound  in  problems  such  as  these.  Additionally,  many  problems  in  numerical 
linear  algebra  have  time  complexities  of  Q(n2)  or  0(7?3)  and  storage  requirements  of 
0(n2)  so  speed  is  essential.  (Appendix  A  reviews  the  complexity  notation  such  as 
big-Oh  and  big-Theta). 

2.    Errors  and  Blunders 

A  clear  understanding  of  the  differences  between  errors  and  blunders  is 
important  since  recognition  of  the  source  of  error  is  prerequisite  to  eliminating  or 
reducing  them.  The  terms  are  introduced  in  [Ref.  17:    p.  1]: 

Blunders  result  from  fallibility,  errors  from  finitude.  Blunders  will  not  be 
considered  here  to  any  extent.  There  are  fairly  obvious  ways  to  guard  against  them, 
and  their  effect,  when  they  occur,  can  be  gross,  insignificant,  or  anywhere  in  be- 
tween. Generally  the  sources  of  error  other  than  blunders  will  leave  a  limited  range 
of  uncertainty,  and  generally  this  can  be  reduced,  if  necessary,  by  additional  labor. 
It  is  important  to  be  able  to  estimate  the  extent  of  the  range  of  uncertainty. 

—  ALSTON  S.  HOUSEHOLDER 
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3.   The  Issues 

To  anticipate — or  even  troubleshoot — error  we  must  know  from  whence  it 
comes.  In  [Ref.  17:  p.  2],  Alston  Householder  lists  the  four  sources  of  error  that 
were  set  forth  by  John  von  Neumann  and  Herman  Goldstine: 

•  Mathematical  formulations  are  seldom  exactly  descriptive  of  any  real  situation, 
but  only  of  more  or  less  idealized  models.  Perfect  gases  and  material  points  do 
not  exist. 

•  Most  mathematical  formulations  contain  parameters,  such  as  lengths,  times, 
masses,  temperatures,  etc.,  whose  values  can  be  had  only  from  measurement. 
Such  measurements  may  be  accurate  to  within  1,  0.1,  or  0.01  percent,  or  better, 
but  however  small  the  limit  of  error,  it  is  not  zero. 

•  Many  mathematical  equations  have  solutions  that  can  be  constructed  only  in 
the  sense  that  an  infinite  process  can  be  described  whose  limit  is  the  solution 
in  question.  By  definition  the  infinite  process  cannot  be  completed.  So  one 
must  stop  with  some  term  in  the  sequence,  accepting  this  as  the  adequate 
approximation  to  the  required  solution.  This  results  in  a  type  of  error  called 
the  truncation  error. 

•  The  decimal  representation  of  a  number  is  made  by  writing  a  sequence  of  digits 
to  the  left,  and  one  to  the  right,  of  an  origin  which  is  marked  by  a  decimal 
point.  The  digits  to  the  left  of  the  decimal  point  are  finite  in  number  and 
are  understood  to  represent  coefficients  of  decreasing  powers  of  10.  In  digital 
computation  only  a  finite  number  of  these  digits  can  be  taken  account  of.  The 
error  due  to  dropping  the  others  is  called  the  round-off  error.  .  .  . 
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C.  MACHINE  METHODS 

We  would  like  to  somehow  characterize  the  techniques  that  make  a  problem- 
solving  method  "good".  The  abilities  of  machines  and  people  are  distinct  enough  that 
we  should  not  always  expect  an  algorithm  for  machine  solution  to  mirror  the  pencil- 
and-paper  method  of  an  individual.  Hestenes  and  Stiefel  make  this  distinction,  defin- 
ing a  hand  method  as  "one  in  which  a  desk  calculator  may  be  used"  and  a  machine 
method  as  "one  in  which  sequence-controlled  machines  are  used.'1  [Ref.  18:  p.  409] 
Further,  in  the  same  reference,  they  list  the  following  characteristics  that  a  good 
machine  method  exhibits: 

(1)  The  method  should  be  simple,  composed  of  a  repetition  of  elementary 
routines  requiring  a  minimum  of  storage  space. 

(2)  The  method  should  insure  rapid  convergence  if  the  number  of  steps  re- 
quired for  the  solution  is  infinite.  A  method  which — if  no  rounding-off  errors 
occur — will  yield  the  solution  in  a  finite  number  of  steps  is  to  be  preferred. 

(3)  The  procedure  should  be  stable  with  respect  to  rounding-off  errors.  If 
needed,  a  subroutine  should  be  available  to  insure  this  stability.  It  should  be  possible 
to  diminish  rounding-off  errors  by  a  repetition  of  the  same  routine,  starting  with 
the  previous  result  as  the  new  estimate  of  the  solution. 

(4)  Each  step  should  give  information  about  the  solution  and  should  yield  a 
new  and  better  estimate  than  the  previous  one. 

(5)  As  many  of  the  original  data  as  possible  should  be  used  during  each  step 
of  the  routine.  Special  properties  of  the  given  linear  system — such  as  having  many 
vanishing  coefficients — should  be  preserved.  (For  example,  in  the  Gauss  elimination 
special  properties  of  this  type  may  be  destroyed.) 

D.  CONJUGATE  DIRECTIONS 

Hestenes  and  Stiefel  describe  the  method  of  conjugate  directions  (CD).  This  is 
a  general  approach  to  solving  systems  of  linear  equations  that  uses  direction  vectors, 
Po,  Pi,  •••,  to  determine  how  the  search  for  a  solution  should  proceed  from  step- 
to-step.  When  the  method  for  determining  these  vectors  is  defined,  CD  becomes  a 
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specific  method.  There  are  at  least  two  of  these  specific  methods  within  CD  that 
are  especially  suited  to  computer  implementation:  Gauss  factorization  (GF)  and  the 
method  of  conjugate  gradients  (CG).  [Ref.  18:    p.  412] 

The  term  conjugate  is  clearly  an  important  one  for  these  methods.  Given  a 
matrix  A  €  3ft" xn  that  is  symmetric,  we  say  that  two  vectors  x  and  y  are  conjugate 
if 

xTAy  =  {Ax)Ty  =  0.  (2.1) 

There  is  an  alternative  term  that  emphasizes  the  role  of  A  in  this  definition.  We  also 
say  that  x  and  y  are  A-orthogonal.  [Ref.  18:    p.  410] 

The  method  of  conjugate  gradients  chooses  its  direction  vectors,  p,,  to  be  mutu- 
ally conjugate  (pf  Ap:  =  0  whenever  i  ^  j)  and  in  such  a  manner  that  p,+i  depends 
upon  p^  (A  specific  formula  is  given  near  the  end  of  Chapter  III).  The  Gauss  fac- 
torization chooses  pi  =  et,  the  ith  axis  vector.  [Ref.  18:    pp.  412,425-427] 

In  this  research,  the  Gauss  method  gets  almost  all  of  the  attention,  but  the 
method  of  conjugate  gradients  receives  a  short  overview  near  the  end  of  Chapter  III. 
The  theory  of  conjugate  directions  is  not  at  all  trivial,  and  the  ties  of  Gauss  and 
conjugate  gradients  to  conjugate  directions  are  fairly  deep.  These  issues  are  covered 
in  the  work  of  Hestenes  and  Stiefel  [Ref.  18].  This  thesis  develops  the  Gauss  method 
from  an  implementation  standpoint. 

E.   PARALLEL  PROCESSING 

The  field  of  parallel  and  distributed  computing  is  a  relatively  new  one.  In 
one  sense,  it  is  quite  natural.  We  perform  work  in  parallel  every  day.  In  fact,  a 
manager-worker  notion  is  a  very  useful  means  to  understand  the  issues  of  this  field. 
The  programs  developed  in  this  research  involve  a  host  or  manager  and  nodes  or 
workers.  This  is  often  called  the  workfarm  approach. 
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The  principal  "problem"  in  parallel  computing  is  communication.  Appendix  C 
relates  some  of  the  considerations.  Of  course,  there  are  other  concerns  as  well:  load 
balancing,  problem  size  (granularity),  and  so  on.  These  issues,  as  they  apply  to  the 
this  research,  are  discussed  in  Chapter  IV. 

The  bottom  line — after  all  of  the  design  and  implementation  work — is  perfor- 
mance. With  multicomputer,  as  in  a  workfarm,  we  are  after  efficiency  so  that  more 
computing  can  be  done  in  a  shorter  time  and  for  less  money.  Bell  is  even  more 
specific.  He  believes  the  multicomputer  must  offer  two  key  facilities  to  become  es- 
tablished [Ref.  6:    p.  1097]: 

•  Power  that  is  not  otherwise  available. 

•  Performance  for  a  price  that  is  "at  least  an  order  of  magnitude  cheaper  than 
traditional  supercomputers." 

In  Chapter  VI,  we  consider  results  obtained  upon  two  contemporary  parallel 
machines.  This  information  helps  us  to  evaluate  the  potential  of  MIMD  architectures 
in  terms  of  Bell's  criteria. 

F.   SPEEDUP 

The  terms  speedup  and  efficiency,  defined  in  Appendix  A,  capture  most  of  the 
interest  when  we  talk  about  the  potential  of  parallel  computing.  The  principal  reason 
for  choosing  a  multicomputer  over  a  single  computer  is  speed.  Therefore,  we  are  most 
interested  in  knowing  what  kind  of  speed  we  can  obtain  from  a  multiprocessor  system. 
Bell's  comments  on  price  are  germane  as  well. 
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Speedup  and  efficiency  are  both  machine  dependent  and  problem  dependent. 
Some  problems  should  not  be  executed  on  a  parallel  machine!  Suppose,  for  instance, 
that  part  of  a  problem  must  be  performed  sequentially.  Amdahl's  law  is  a  well-known 
attempt  to  characterize  this  problem.  Amdahl  stated  that  speedup  on  P  processors, 
5,  is  limited  in  the  following  manner: 

*  *  jrihrp  (2-2) 

where  /  is  "the  fraction  of  operations  in  a  computation  that  must  be  performed 
sequentially,  where  0  <  /  <  1"  [Ref.  19  :  p.  19].  With  speedup,  5,  defined  as 
in  (2.2)  we  see  that 

lim  S  =  i  (2.3) 

F-oo  J 

Figure  2.2  shows  how  this  limit  begins  to  take  effect  as  the  number  of  processors, 
P,  is  increased  from  zero  to  500.  The  figure  is  based  on  Amdahl's  law  (2.2)  with 
sequential  percentages,  /,  of  5%,  10%,  and  25%. 

We  can  see  that  Amdahl's  law  has  some  very  discouraging  news  for  so-called 
massively  parallel  computing.  The  massive  part  of  the  term  is  loosely  defined,  appar- 
ently meaning  "many"  processors.  But  Amdahl's  law  may  be  based  upon  a  faulty 
assumption  [Ref.  20].  Consider  the  following  reasoning.  Let  P  be  the  number  of 
processors  and  consider  the  following  arguments  concerning  time.  Let  s  be  the  time 
required  to  execute  the  serial  portions  of  a  program  on  a  serial  processor  and  let 
p  be  the  amount  of  time  required  to  complete  the  parallel  work  on  the  same  serial 
processor.   Using  this  notation,  and  normalizing  (s  +  p  =  1),  Amdahl's  law  can  be 

restated 

s  +  p  1 

s  +  (p/P)       s  +  (p/Py  {'} 

Then,  if  we  consider  the  case  P  =  1,024  with  5  <  10%,  we  see  in  Figure  2.3,  that 
speedup  is  severely  restricted. 


30 


20 
18 
16 
14 
12 

Q. 

1      ,0 

CO 

8 

6 

4 

2 

0 
( 

/ 

=  0.05 

/ 

=  0.10 

-  f    ■>•'•'' 

.,--""*" 

/  / 

;::;««»« \ 

™J 

=  0.25 



— - 

)          50         100        150        200        250        300        350        400        450        5( 

Number  of  Processors 

X) 
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G.   SCALED  SPEEDUP 

These  problems  with  the  usual  notion  of  speedup  led  Gustafson,  Montry,  and 
Benner  to  question  the  validity  of  Amdahl's  assumptions  [Rcf.  20:    p.  3]: 

The  expression  and  graph  are  based  on  the  implicit  assumption  that  p  is 
independent  of  P.  However,  one  does  not  generally  take  a  fixed  size  problem  and 
run  it  on  various  numbers  of  processors;  in  practice,  a  scientific  computing  problem 
scales  with  the  available  processing  power.  The  fixed  quantity  is  not  the  problem 
size  but  rather  the  amount  of  time  a  user  is  willing  to  wait  for  an  answer;  when 
given  more  computing  power,  the  user  expands  the  problem  (more  spatial  variables, 
for  example)  to  use  the  available  hardware  resources. 

As  a  first  approximation,  we  have  found  that  it  is  the  parallel  part  of  a  pro- 
gram that  scales  with  the  problem  size.  Times  for  program  loading,  serial  bottle- 
necks, and  I/O  that  make  up  the  s  component  of  the  application  do  not  scale  with 
the  problem  size.  When  we  double  the  number  of  processors,  we  double  the  number 
of  spatial  variables  in  a  physical  simulation.  As  a  first  approximation,  the  amount 
of  work  that  can  be  done  in  parallel  varies  linearly  with  the  number  of  processors 


Based  upon  this  analysis,  they  present  the  notion  of  scaled  speedup.  They  let 
s'  and  p'  represent  the  serial  and  parallel  time  spent  on  a  parallel  system  (inverse  of 
Amdahl's  method).  So  that  s'  +  p'  =  1  and  a  uniprocessor  requires  time  s'  +  p'P  to 
perform  the  task.  With  these  definitions,  they  define  scaled  speedup,  S',  to  be 

S'  -r  v'P 

S'  =  tllS.  =  p  +  (i  _  py.  (2.5) 

s   -r  p 
If  we  consider  the  same  range  of  serial  fractions  as  we  did  in  Figure  2.3,  we  see  that 
scaled  speedup  is  much  better  than  the  usual  speedup.  Figure  2.4  shows  the  plot  of 
scaled  speedup. 

H.   SUMMARY 

This  chapter  considers  the  background  necessary  to  develop  the  algorithms 
(Chapters  III  and  IV)  and  implement  them  (Chapter  V).  Algorithms  are  described 
as  sequential  plans  first  (Chapter  III).  The  Gauss  factorization  algorithm  is  given 
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Figure  2.4:  Scaled  Speedup 

in  detail  (Chapter  III),  including  a  discussion  on  the  significance  of  pivoting.  The 
method  of  conjugate  gradients  receives  less  attention,  but  a  brief  introduction  is 
given  near  the  end  of  Chapter  III.  The  parallel  considerations  surveyed  quickly  in 
this  chapter  receive  more  attention  in  Chapter  IV. 
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III.    THEORY 

No  human  investigation  can  be  called  real  science  if  it  cannot  be  demonstrated 
mathematically. 

—  LEONARDO  DA  VINCI  (1452-1519) 

A.   SCOPE 

The  goal  of  this  research  is  to  demonstrate  a  parallel  method  for  solving  a 
system  of  linear  equations.  The  implementation  targets  two  contemporary  MIMD 
architectures:  the  Intel  iPSC/2  and  networks  of  INMOS  transputers.  There  are  many 
methods  for  solving  linear  systems.  This  work  concentrates  primarily  upon  Gauss 
factorization  (GF),  but  the  method  of  conjugate  gradients  (CG)  is  also  introduced. 
Regrettably,  CG  is  not  developed  due  to  time  constraints  (the  derivation  is  not 
trivial).  This  does  not  imply  that  Gauss  factorization  is  superior,  nor  that  it  possesses 
greater  potential  for  parallel  solution.  Indeed,  Hestenes  and  Stiefel  preferred  CG  to 
GF  for  a  number  of  very  good  reasons  [Ref.  18 :    p.  409]. 

As  we  shall  see,  the  utility  of  either  method  is  quite  dependent  upon  the  nature 
of  the  particular  problem.  Consider  the  system  of  linear  equations  represented  by 

Au  =  b.  (3.1) 

Much  of  the  subsequent  discussion  applies  to  general,  rectangular  systems  where 
A  €  9ftm*n.  For  the  examples,  however,  square  systems  (A  £  9ftnXn)  are  used.  This 
restriction  greatly  simplifies  the  discussion  without  losing  much  of  the  concept  as 
it  applies  to  general  systems.  The  Gauss  process,  i.e.,  the  main  part  of  the  work, 
excluding  the  stopping  criteria  and  interpretation  of  the  result,  is  the  same  in  all 
three  cases  (m  <  n,  m  =  n,  and  m  >  n). 
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To  be  sure,  the  three  cases  (m  <  n,  m  =  n,  and  m  >  n)  correspond  to  funda- 
mentally different  real-world  systems,  but  the  algorithms  for  each  case  are  almost 
identical.  The  restriction  to  a  square  system  will  greatly  simplify  the  discussion 
without  blinding  us  to  the  general,  rectangular  case.  The  extensions  to  the  general 
case  are  well  known.  Golub  and  Van  Loan  [Ref.  21 :  p.  102]  give  more  detail,  but  the 
square  case  is  most  expedient  for  now.  Square  systems  also  simplify  the  experimental 
procedure,  data  collection  and  analysis. 

The  Gauss  method  follows  naturally  from  a  hand  method  and  it  holds  strong 
appeal  to  intuition.  Without  a  pivoting  strategy,  however,  Gauss  can  attempt  division 
by  zero.  There  is  also  a  more  subtle  issue  of  rounding  errors  within  the  limits  of 
finite-precision  arithmetic.  To  forestall  errors  of  both  kinds,  partial  and  complete 
pivoting  strategies  are  used.  This  chapter  develops  the  (sequential)  algorithms  and 
explains  the  concept  of  pivoting.  This  is  a  sensible  starting  point  for  Chapter  IV, 
where  parallel  versions  of  the  algorithms  are  given. 

B.   APPROACH 

There  are  many  methods  that  may  be  applied  to  determine  the  solution  of  a 
system  of  linear  equations.  The  methods  were  designed  for  different  reasons  and 
with  different  problems  in  mind,  so  each  exhibits  a  unique  behavior.  One  method 
is  often  preferred  over  another  for  a  given  problem.  Ultimately,  the  criterion  is 
performance,  both  in  reliability  and  speed.  The  approach  described  here  and  in  the 
remaining  chapters  seeks  to  "maximize  performance"  while  retaining  a  reasonable 
balance  of  both  efficiency  and  quality.  Speed  and  numerical  accuracy  tend  to  oppose 
one  another  so  we  are  left  to  choose  from  several  options. 

A  hand  method  introduces  each  algorithm.  The  example  is  small  and  concrete. 
Solving  a  small  problem  gives  useful  insights  into  the  algorithms.  Once  the  hand 
method  is  established,  it  is  expressed  in  an  equivalent  matrix  notation.  A  high-level 
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sequential  algorithm  is  built  upon  this  foundation.  This  algorithm  shows  how  a 
machine,  using  a  sequence  of  instructions,  solves  the  problem.  It  also  gives  good  es- 
timates for  the  problem's  time  and  storage  complexities.  The  sequential-to-parallel 
transition  involves  enough  issues  to  warrant  separate  coverage.  These  considerations 
appear  in  Chapter  IV. 

In  the  sections  that  follow,  Gaussian  elimination  is  presented  first.  It  reveals  the 
background  (sort  of  a  first  pass)  for  Gauss  factorization.  Once  the  reduction  process 
is  understood,  we  proceed  to  factorization.  A  description  of  the  method  of  conjugate 
gradients  is  given  at  the  end  of  the  chapter.  This  method,  due  to  Hestenes  and 
Stiefel,  is  based  upon  relatively  deep  theory.  Thus  the  derivations  and  background 
are  not  included.  Nevertheless,  a  synopsis  of  the  method  is  given. 

C.   APPLYING  THE  METHODS 

A  particular  method  is  often  tailored  to  a  specific  type  of  system.  The  method 
of  conjugate  gradients,  for  instance,  is  usually  used  when  the  matrix  of  coefficients, 
A,  is  symmetric  and  positive  definite  [Ref.  18:  p.  411].  The  Gauss  factorization 
algorithm  is  equally  important,  but  it  takes  quite  another  approach  to  solving  this 
system.  Both  CG  and  GF  lie  within  the  broad  category  of  methods  of  conjugate 
directions  (Chapter  II).  Indeed  both  work  in  just  about  any  case.  But,  the  better 
results  are  obtained  by  using  the  tool  that  fits  the  task  at  hand. 

A  very  rough  characterization  of  the  problem  can  simplify  algorithm  selection. 
We  will  look  for  two  qualities:  structure  and  density.  CG,  for  instance,  performs 
best  when  applied  to  highly  structured,  sparse  matrices  (i.e.,  matrices  with  many  zero 
entries).  Systems  like  the  sparse,  symmetric,  highly-structured  result  of  Appendix  D 
deserve  careful  solutions  that  do  not  destroy  the  existing  zeros.  Zeros  are  not  always 
easy  to  come  by.  Gaussian  elimination  must  expend  2n3/3  flops  to  create  them. 


37 


Selecting  the  wrong  algorithm  can  lead  to  slower  execution.  More  importantly, 
poor  algorithm  choice  is  a  blunder  (Chaper  II).  It  can  produce  results  that  are  ac- 
cidentally perfect,  grossly  incorrect,  or  anywhere  between.  Therefore,  no  less  than 
three  tasks  confront  us: 

•  Characterize  the  problem.    In  systems  like  (3.1),  attributes  of  the  matrix  of 
coefficients,  A,  may  provide  a  wealth  of  information. 

•  Understand  the  algorithm(s).  Know  the  types  of  problem(s)  it  is  designed  for 
(and,  more  importantly,  know  why). 

•  Create  or  select  an  algorithm  that  suits  the  problem. 

The  sparse,  highly-structured  problems  are  not  rare!  Anyone  who  has  observed 
nature  knows  that  many  natural  phenomena  exhibit  incredible  structure  and  sim- 
plicity. Strategies  for  solving  the  corresponding  system  should  always  seek  to  exploit 
these  characteristics.  Both  sparseness  and  structure  can  reduce  storage  requirements 
and  the  number  of  flops  required.  If  we  know  the  structure  in  advance,  there  may 
be  a  smart  way  to  avoid  some  calculations  entirely  or  minimize  the  work  involved. 
(Recall  Hestenes  and  Stiefel's  characterization  of  a  "good"  machine  method  from 
Chapter  II).  Other  problems,  when  translated  into  the  form  (3.1),  exhibit  a  dense 
matrix,  A,  with  little  or  no  apparent  structure. 

These  two  types  of  problems  should  not  be  handled  with  the  same  tools.  As 
with  many  computational  problems,  the  reasons  involve  the  use  of  time  and  space. 
We  shall  see  that  the  Gauss  algorithm  has  time  complexity  0(rc3)  and  storage  re- 
quirements 0(n2).  (Complexity  notation  appears  in  Appendix  A).  Numbers  like 
these  grow  rapidly  with  n  and,  regardless  of  how  much  memory  is  available,  the 
problem  can  quickly  overpower  the  computer.  A  naive  approach  to  problems  of 
these  kinds  can  be  expensive  in  terms  of  both  storage  and  time.    This  is  usually 
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adequate  incentive  to  take  advantage  of  sparseness  and  structure  whenever  possible. 
When  it  is  not  possible,  Gauss  is  a  good  choice. 

D.   GAUSSIAN  ELIMINATION 

Suppose  that  we  want  to  solve  a  system  of  linear  equations  using  a  systematic, 
step-by-step  method.  We  assume  that  the  system  of  linear  equations  is  given,  and 
that  the  method  must  preserve  the  original  properties  of  the  system.  That  is,  the 
method  must  be  restricted  to  certain  operations;  namely: 

•  Multiply  an  equation  by  a  nonzero  constant. 

•  Interchange  equations. 

•  Add  a  multiple  of  one  equation  to  another. 

The  fact  that  the  first  two  operations  do  not  change  the  system's  properties  is  ev- 
ident. The  third  operation  is  legitimate  also — maybe  not  quite  so  obviously — and 
computationally,  the  most  significant.  Now  let  us  apply  some  of  these  operations  to 
a  system  of  four  equations  in  the  four  unknowns,  t>i ,  t>2,  t>3,  and  v4. 

2vx  +  3v2  +  4u3  4-  5i;4  =  0 

4^  +  6v2  +  8i>3  +  5u4  =  -5                                      .      > 

2u,  +  4u2  +  7u3  +  9u4  =  13                                     [      } 

6vx  +  8t'2  +  8u3  +  9v4  =  -17 

Let  m  (=  4)  be  the  number  of  equations,  and  let  n  (  =  4)  be  the  number  of  unknowns 
in  each  equation.  Additionally,  let  i  be  an  equation  (or  row)  index  (1  <  i  <  m)  and 
let  j  indicate  a  subscript  of  v  (column  index)  so  that  1  <  j  <  n.  Finally,  let  atJ  be  the 
coefficient  of  Vj  in  equation  i  (e.g.,  q12  =  3).  Suppose  that  the  last  equation  contains 
only  one  nonzero  coefficient  (say  q44)  and  the  third  equation  has  only  two  nonzero 
coefficients  (a33  and  cr34)  and  so  on.  This  defines  a  triangular  system  (Appendix  A). 
The  triangular  system  is  our  goal  because  it  is  easier  to  solve  (by  back  substitution) 
than  the  current  (square,  dense)  system. 
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Next,  observe  that  a  triangular  system  would  result  if  we  could  eliminate  every 
coeffic  i«nl,  o,,,  of  v]  in  all  equations  but  the  first  (i  >  1),  coefficients,  a,2,  of  v2  in 
the  last  two  equations  (i  >  2),  and  the  coefficient,  o43,  of  v3  in  the  final  equation.  To 
do  this,  we  work  by  stages.  At  stage  fc,  the  coefficient,  cr^,  of  Vk  in  the  kth  equation 
is  called  the  pivot  This  term  has  little  significance  now  but  is  clarified  later  (and 
it  plays  a  very  important  role  in  the  examples  presented.  In  a  particular  stage,  k, 
the  goal  is  to  operate  upon  all  equations  ?  where  i  £  {(&+  l),(k  +  2), . ..  ,m}  and 
eliminate  all  coefficients,  <*,-£,  of  i?^, 

1.    A  Hand  Method 

Before  attempting  to  describe  ail  algorithm  for  a  machine  solution,  we  con- 
sider an  application  of  Gaussian  elimination  (CE)  by  hand.  Initially,  let  k  =  1.  In 
the  example  system  (3.2),  the  first  (k  =  1)  pivot  is  the  coefficient,  an  =  2,  of  V\ 
iii  the  first  equation.  Notice  that  by  subtracting  twice  the  first  equation  from  the 
second,  a  zero  is  produced  under  the  pivot  (eliminating  Q^l)-  Similarly,  by  subtract- 
ing the  first  equation  from  the  third,  a  zero  appears  as  the  leading  coefficient  in  the 
third  equation  (eliminating  o.u)-  Finally,  three  times  the  first  equation  subtracted 
from  the-  fourth  equation  eliminates  the  coefficient  o.u.    hollowing  these  steps  the 

altered  system  is: 

2uj  4  3ua  f  •!(';,  +  r><\,    =        0 


rw  —  r\ 

ua  +  3u3  +  4u4    =       L3  K<     ' 

—p2  —  4t'3  —  6v4    =     — 17 

This  is  called  the-  natural  reduction  process  [Kef.  22:  p.  72].  In  the  particular  case, 
there-  are  no  changes  on  the  right  hand  side  because  the  first  equation's  right  hand 
side  is  zero.    This  makes  for  trivial  arithmetic  on  the  right    hand  side*,  but  we  should 

remember  \o  perform  the  arithmetic  upon  whole  equations  (including  the  right  hand 
side)  in  general.  The  elimination  is  even  more  successful  than  planned. 
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The  second  equation  already  has  zeros  where  we  ultimately  wanted  them 
in  the  fourth  equation.  That  is,  the  system  (3.3)  would  be  closer  to  upper  triangular 
if  we  were  to  alter  it  by  interchanging  equations  2  and  4. 


2t<!  +3f2  +4i'3  +  5l'4  =  0 

— 1'2  —  4i>3  —  6v4  =  —17 

v2  4-  3i'3  4-  4t»4  =  13 

— 5t'4  =  —5 


(3.4) 


The  system  (3.4)  is  called  a  row  permutation  of  (3.3).  The  ability  to  recognize 
patterns  is  a  great  advantage  that  human  problem  solvers  enjoy.  Therefore,  taking 
advantage  of  our  capabilities  we  use  a  rather  subjective  "human"  pivoting  strategy. 
But  it  is  not  fitting  to  assume  that  an  efficient  algorithm  for  a  machine  would  involve 
the  same  sort  of  pattern  recognition. 

The  system  (3.4)  is  nearly  triangular.  The  pivot  moves  to  the  second  equa- 
tion (k  =  2),  and  we  focus  on  the  coefficient,  a22  =  — 1,  of  Vk  =  v2.  By  adding 
the  second  equation  to  the  third,  the  only  nonzero  coefficient  remaining  in  the  lower 
triangle  (0.32)  is  eliminated.  The  resulting  system  becomes 


2ui  +  3v2  +  4t'3  +  5u4  =  0 

— 1>2  —  4t>3  —  6u4  =  —17 

-t'3  -  2v4  =  -4 

— 5i>4  =  —5 


(3.5) 


The  system  is  triangular,  and  it  is  easy  to  solve  for  the  unknown  values,  t>,,  by  back 
substitution.  By  inspection,  v4  =  1.  Substituting  this  value  into  the  third  equation, 
we  find  that  v3  —  2.  Substituting  both  values  (v4  and  i;3)  into  the  second  equation 
yields  v2  —  3.  Finally,  by  substituting  the  values  v4,  t>3,  and  v2  into  the  first  equation 
gives  t»!  =  —11.  The  solution  to  the  system  is  then 


u  = 


'  U]  ' 

"    -11    " 

v2 

^3 

= 

3 

2 

.   ^4    . 

1  . 

(3.6) 
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2.    A  Machine  Method 

The  foregoing  example  illustrated  the  GE  process  as  done  on  paper.  The 
system  was  intentionally  created  for  easy  solution  by  hand  calculation.  I.e.,  it  uses 
integers  and  elimination  occurs  faster  than  the  usual  case.  Even  this  simple  example 
requires  a  few  minutes  to  determine  u  from  the  system  (3.2)  by  hand.  In  Chapter 
VI,  we  see  that  a  machine  can  perform  this  task  in  (much)  less  than  a  second.  For 
this  reason,  it  is  worth  examining  an  equivalent  process  to  solve  for  such  a  system 
by  machine. 

We  reenact  the  solution  from  the  beginning,  this  time  in  a  fashion  that 
a  sequence-controlled  machine  could  perform.  Until  now,  we  have  used  the  term 
"pivot"  but  have  found  no  practical  use  for  pivots.  In  this  example,  we  begin  to 
realize  the  utility  of  a  pivoting  strategy.  We  start  with  "no  pivoting"  and  shift  to 
the  "partial  pivoting"  strategy.  Additionally,  we  begin  to  use  a  more  compact  matrix 
notation.  Appendix  A  describes  the  notation  followed. 

By  the  method  described  in  Appendix  A,  we  give  the  linear  system  (3.2) 
matrix  representation  that  corresponds  to  (3.1): 


Au  = 


2  3  4  5 

4  6  8  5 

2  4  7  9 

6  8  8  9 


"  Vi    ' 

"       0  " 

[Al 

v2 

-5 

02 

V3 

13 

03 

.   ^4    . 

.  ~17  . 

.04. 

=  b. 


(3.7) 


First,  we  initialize  a  stage  counter,  fc,  so  that  k  =  1.  The  pivot  in  stage  k  is  otkk,  on 
the  diagonal  of  A  (an  =  2).  The  immediate  goal  is  to  produce  zeros  beneath  the 
pivot,  in  v4(2:4, 1).  A  three-step  process  eliminates  these  coefficients  in  row  order: 

•  Divide.  Divide  every  element  beneath  the  pivot  by  the  pivot  value. 

•  Update.  Perform  arithmetic  in  the  Gauss  transform  area. 

•  Eliminate.  Set  the  elements  beneath  the  pivot  equal  to  zero. 
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The  first  step  is  a  division.   The  denominator  (pivot)  is  ctkk  =  Qn  =  2  so 
o-2i  becomes  the  multiplier  (a2i/2)  =  2.  Similarly,  let  cr31  =  1  and  let  a41  =  3.  Now 


A  = 


2    3  4  5 

2  6  8  5 
14  7  9 

3  8  8  9 


(3.8) 


2 

3 

4 

5  ' 

"  Vi 

r        0 

0 

0 

0 

-5 

v2 

-5 

0 

1 

3 

4 

^3 

13 

0 

-1 

-4 

-6  . 

.   ^4    . 

.  -17 

Next,  consider  everything  below  and  to  the  right  of  the  pivot.  This  is  the  Gauss 
transform  area,  G  =  A((k  +  l):m,  (k  +  l):n)  =  A(2: 4, 2:  4).  For  each  element  in 
G,  replace  the  current  value,  q;j •,  with  ctij  —  (a,-*  )(»*,■).  Do  the  same  thing  in  the 
corresponding  rows  (i  >  k)  of  6,  replacing  j3,  with  j3t  —  (a,fc)(^).  We  will  call  this 
the  process  of  performing  arithmetic  in  (or  updating)  the  Gauss  transform  area,  G. 
Finally,  when  the  values  beneath  the  pivot  are  no  longer  needed,  eliminate 
them  (set  them  equal  to  zero).  The  result  is  equivalent  to  the  system  (3.3): 


(3.9) 


We  have  finished  one  stage  of  GE.  We  move  into  the  next  stage,  k  =  2.  This  time, 
when  we  try  to  update  G  we  run  into  a  very  serious  problem.  The  first  step  is  to 
divide  everything  underneath  the  pivot  by  the  pivot  value  a^k  =  Q22  —  0-  This  is 
the  divide-by-zero  problem  of  a  "no  pivoting''  strategy. 

During  the  execution  of  the  hand  example  we  simply  moved  the  row  to  the 
bottom  of  the  system  to  avoid  this  problem.  Now,  we  could  instruct  the  machine 
to  test  every  element  in  A(k  :  m ,  k  :  n)  and  interchange  rows  so  that  those  with 
the  most  leading  zeros  were  placed  at  the  bottom.  This  is  problematic  for  several 
reasons.  First,  it  is  not  dependable  (testing  for  equality  of  floating-point  numbers 
begs  disaster).  Secondly — even  if  we  could  identify  zeros  with  confidence^ — it  would 
add  a  sorting  problem  to  GE!  We  are  not  looking  for  extra  work.  The  solution  is 
partial  pivoting. 
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3.    Partial  Pivoting 

Partial  pivoting  is  an  application  of  row  interchanges  to  eliminate  (primar- 
ily) the  divide-by-zero  problem.  Consider  the  system  of  equations  (3.1)  with  the 
nonsingular  matrix  of  coefficients,  A  £  3?mXn  (i.e.,  m  =  n  and  the  system  has  exactly 
one  solution).  Suppose  further  that  storage  and  arithmetic  is  performed  in  infinite 
precision.  (These  assumptions — infinite  precision  and  A  nonsingular — are  essential). 

Even  in  this  ideal  situation  Gauss  without  pivoting  is  dangerous  because, 
as  we  have  just  seen,  it  may  attempt  to  divide  by  zero.  Proper  row  permutations 
completely  eliminate  this  problem.  Partial  pivoting  will  guarantee  the  existence 
of  n  nonzero  pivots  for  A  nonsingular.  In  fact,  if  we  encounter  a  zero  pivot  with 
partial  pivoting,  it  means  that  A  is  singular  [Ref.  23].  The  remainder  of  this  section 
describes  the  partial  pivoting  strategy. 

Consider  stage  k  of  the  GE  process  with  A  €  3£mXn.  The  goal  is  to  pick 
the  "best"  row  remaining  (i.e.,  at  or  below  the  current  pivot)  and  install  it  as  row 
k,  the  pivot  row.  For  reasons  that  are  explained  later,  "best"  shall  mean  the  row 
whose  k  (pivot  column)  element  is  largest.  Let  s  be  the  row  index  for  the  best 
pivot  candidate.  Initially,  let  s  =  k  (i.e.,  a^k  is  the  first  candidate).  Next,  we  move 
down  the  pivot  column,  considering  all  a,^  where  i  >  k. 

To  eliminate  unnecessary  assignments,  we  replace  the  current  candidate 
with  another  only  if  \atk\  >  \ask\-  When  this  occurs,  we  make  sure  that  s  is  updated 
by  setting  it  equal  to  i.  After  considering  all  elements,  a,-*,  for  k  <  i  <  m,  s  is  the 
index  of  "best  possible"  pivot  row.  To  accomplish  our  goal,  we  must  perform  a  row 
interchange.  This  is  easy  after  the  new  pivot  row  has  been  determined.  We  simply 
swap  rows  k  and  s  (if  k  ^  s).  Within  the  assumptions  above,  we  have  completely 
eliminated  the  potential  for  division  by  zero.  Now  let  us  return  to  the  problem  at 
hand. 
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4.    A  Machine  Method  (Resumed) 


Applying  partial  pivoting  to  the  system  (3.9),  we  find  that  the  next  pivot 
is  located  at  ,4(3,2)  so  we  must  interchange  rows  (equations)  two  and  three.  Be- 
fore performing  this  step,  however,  let  us  create  a  vector  to  keep  track  of  the  row 
permutations.  Let  q  6  3ftm  be  the  row  permutation  vector.  We  initialize  q  so  that 
V\  =  i' 


9  = 


'V'i  " 

"    1    " 

02 

2 

03 

3 

.    04    . 

.  4  . 

(3.10) 


and  perform  row  interchanges  in  q  corresponding  to  those  in  A  so  that  t/>,  is  always 
the  original  equation  number  for  current  equation  number  i.  Thus,  after  performing 
the  row  interchange,  we  have 


2 

3 

4 

5  " 

"  t>i 

0 

0 

1 

3 

4 

v2 

13 

0 

0 

0 

-5 

v3 

-5 

0 

-1 

-4 

-6  . 

.    ^4    . 

.  -17 

(3.H) 


Notice  that  r/»3  =  2  indicates  that  the  third  equation  in  (3.11)  was  the  second  equation 
in  the  original  system  (3.7).  Now,  since  o:32  =  0,  no  arithmetic  is  required  in  the 
third  row.  In  row  four,  the  arithmetic  will  be  equivalent  to  the  notion  of  adding  (the 
current)  equation  two  to  equation  four.  The  result  is 


(3.12) 


When  we  move  the  pivot  index  to  the  third  equation  (k  =  3),  we  notice  that  CC33  =  0. 
The  divide-by-zero  problem  has  resurfaced.  Once  again,  we  pivot,  swapping  rows 
three  and  four.  After  this,  we  have 


2    3 

4 

5  " 

'   Vi    ' 

"     0 

0    1 

3 

4 

v2 

13 

0    0 

0 

-5 

v3 

-5 

0    0 

-1 

-2  . 

.    V4    . 

.  -4 

"  2    3 

4 

5  " 

'  vx  " 

r     0 

0    1 

3 

4 

v2 

13 

0    0 

-1 

-2 

V3 

-4 

.  0   0 

0 

-5  . 

.   t'4    . 

.  -5 

9  = 


(3.13) 
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The  zero  beneath  the  final  pivot  obviates  the  need  for  further  arithmetic.  The  trian- 
gular system  (3.13),  found  by  our  machine  method,  does  not  look  like  the  system  (3.5) 
from  the  hand  method  because  we  did  not  perform  the  same  row  interchanges.  If  we 
had  maintained  a  row  permutation  vector,  q,  for  the  hand  method  we  would  have 
noticed  that 


* 


=  <?• 


(3.14) 


"  *'l  " 

"    -11    " 

v2 

3 

V2, 

2 

.   t'4    . 

1  . 

Of  course,  back  substitution  for  the  final  (triangular)  machine  system  (3.13)  yields 
the  same  solution 

"  i'i  i     r  -ii " 

(3.15) 

as  thai  of  the  hand  method.  Thus,  even  though  we  used  different  permutation 
schemes,  the  "pivots"  in  both  cases  were  always  nonzero  and  the  solutions  were  the 
same.  This  is  not  surprising,  since  A  is  nonsingular  and  row  permutation  is  merely 
the  practice  of  interchanging  equations. 

Let  us  review  first  the  process  and  then  the  theory  of  Gaussian  elimination. 
The  GE  process  performs  a  systematic  elimination  of  the  lower  (in  our  example) 
triangle  of  a  matrix  of  coefficients,  A.  Arithmetic  operations  are  performed  upon 
entire  equations  at  the  same  time  (including  the  right-hand  side,  6).  In  other  words, 
during  stage  k  of  the  process,  arithmetic  operations  are  performed  upon  (portions  of) 
all  rows  i  (i  >  k)  of  A  and  upon  all  elements  (rows)  /?,-  (for  i  >  k)  of  the  right-hand 
sides,  6.  The  process  depends  upon  both  A  and  6  and  both  of  them  can  be  changed 
substantially. 

The  idea  behind  Gaussian  elimination  is  that  general  square  systems  are 
difficult  to  solve,  but  triangular  systems  are  easy.  The  goal  is  to  transform  a  general 
matrix  A  into  triangular  form,  performing  legitimate  arithmetic  upon  entire  equa- 
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tions  (including  the  right-hand  sides).  Reduction  to  triangular  form  costs  2n3/3 
flops.  Once  A  is  reduced  to  triangular  form,  back  substitution  yields  a  solution  for 
the  unknown,  w,  in  n2  flops.  Thus  GE  solves  a  general,  dense,  square  system  of  n 
equations  in  n  unknowns  by  the  application  of  2n3/3  +  n2  flops.  [Ref.  21  :   pp.  88,  97] 

E.   GAUSS  FACTORIZATION 

Gauss  factorization  (GF)  is  a  well-known  method  for  solving  linear  systems 
like  (3.1)  that  (simultaneously)  factors  A.  GF  has  strong  ties  to  the  GE  process. 
Those  ties  will  become  evident  as  we  develop  the  same  example  over  again,  this  time 
using  the  GF  bookkeeping  and  method.  GF  holds  several  major  advantages  over  GE. 
Among  these:  A  is  recoverable  (the  process  does  not  destroy  it)  and  the  process  is 
independent  of  the  right-hand  side,  b.  In  fact,  b  is  not  used  in  the  factoring  process. 

1.    Complete  Pivoting 

The  complete  pivoting  strategy  will  be  applied  in  this  example.  There  is  no 
special  significance  behind  the  introduction  of  complete  pivoting  with  the  GF  process. 
Either  strategy — the  choice  of  a  "no  pivoting"  strategy  is  also  available,  but  not 
generally  acceptable  for  serious  problems — can  be  used  with  GEor  GF.  The  complete 
strategy  is  a  straightforward  extension  of  the  partial  strategy,  so  introducing  partial 
pivoting  first  was  practical. 

With  complete  pivoting,  row  interchanges  are  still  allowed,  but  so  are  col- 
umn interchanges.  We  will  continue  to  use  q  £  3£m  for  row  interchange  bookkeeping. 
The  vector  p  G  3ftn,  similarly,  will  maintain  the  column  permutation  information.  We 
search  not  just  the  pivot  column,  but  the  entire  Gauss  transform  area,  for  the  next 
pivot.  This  takes  longer  but  generally  produces  better  solutions.  The  numerical  dif- 
ferences between  partial  and  complete  pivoting  involve  some  difficult  error  analysis. 
These  issues  will  be  addressed  briefly  after  we  complete  the  examples. 
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2.    Example 

Now  the  GF  process  is  demonstrated.   We  start  with  the  same  system  of 

four  equations  in  four  unknowns: 

2v !  +  3v2  +  4t>3  +  5t>4  =  0 

4uj  +  6v2  +  8v3  +  bv4  =  -5                                             , 

2uj  +  4i>2  +  7u3  +  9u4  =  13                                   l        j 

6ui  +  8u2  +  8u3  +  9v4  =  -17 

and  proceed  immediately  to  the  matrix  of  coefficients  (the  factoring  part  of  GF 

concerns  itself  with  A  only). 


A  = 


2  3  4  5 

4  6  8  5 

2  4  7  9 

6  8  8  9 


(3.17) 


a.    Stage  Zero 


For  the  initial  stage,  k  =  0,  let  the  Gauss  transform  area  be  G  =  A. 
Also  initialize  pivot  indices  s  =  t  =  1.  The  sole  purpose  of  stage  zero  is  to  find  the 
first  pivot.  Initially,  we  guess  that  the  pivot  is  an,  located  at  j4(1,1),  the  upper 
left-hand  corner  of  G.  (This  is  the  position  where  the  new  pivot  will  be  installed). 
Accordingly,  we  set  row  and  column  indices,  5  =  1  and  t  =  1  to  keep  track  of  the 
best  pivot  candidate. 

Indices  s  and  t  are  changed  only  when  we  find  a  superior  candidate  for 
the  pivot.  To  begin  the  column-by-column  search  for  the  pivot  we  move  down  the 
columns  in  order  from  left  to  right  and  through  each  column  in  a  top-to-bottom 
manner.  When  we  have  considered  every  element  in  G,  we  know  that  the  next  pivot 
is  currently  situated  at  A(s,t). 

For  the  current  example,  as  we  move  down  the  first  column  of  G,  the 
values  of  5  and  t  are  adjusted  twice.  A  better  pivot  candidate  is  found,  first  at  A(2, 1), 
and  next  at  A(4, 1).  The  indices  are  adjusted  again  in  the  last  row  of  column  two, 
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where  the  value,  8,  is  larger  than  the  value  of  the  current  candidate,  6.  Column 
three  has  no  candidates  larger  than  8,  so  we  do  not  adjust  the  indices  again  until  we 
find  the  9  at  ,4(3, 4).  Thus  5  =  3  and  t  =  4  have  located  the  next  pivot  according 
to  a  complete  pivoting  strategy.  This  accomplishes  the  goal  of  stage  zero.  Now  we 
specify  the  process  for  each  of  the  remaining  stages. 

b.  Outline  of  the  GF  Process 

For  each  stage,  k,  of  GF,  we  shall  perform  the  following  steps: 

•  Locate  the  pivot  according  to  a  pivoting  strategy  (none,  partial,  or  complete). 
If  complete  pivoting  is  used,  search  all  of  G  for  the  next  pivot. 

•  Increment  the  pivot  index,  k. 

•  Perform  any  row  and/or  column  permutations  that  are  required  to  move  the 
pivot  into  the  position  A(k,  k).  Update  p  and  q  accordingly. 

•  Divide  every  element  beneath  the  pivot  by  the  pivot  value. 

•  Redefine  the  Gauss  transform  area  so  that  G  =  A((k  -\-  l):m,  (k  +  1)  :")• 

•  Perform  the  appropriate  arithmetic  in  G. 

Let  us  return  to  the  example  and  exercise  the  process. 

c.  Stage  One 

Since  stage  zero  has  already  located  the  first  pivot,  the  first  step  of 
section  b  is  not  necessary  in  this  stage.  We  increment  k  (to  k  =  \)  and  install  the 
pivot  ,4(3,4)  at  A(k,  k)  =  ,4(1,1).  This  means  that  rows  1  and  3  must  be  swapped. 
Columns  1  and  4  must  be  swapped  in  addition.  The  permutation  vectors,  p  and  q, 
record  the  interchanges. 
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After  interchanging  rows  and  columns,  we  have 


A  = 


9  4  7  2" 

'  4  " 

"  3 

5  6  8  4 
5  3  4  2 

P  = 

2 
3 

9  = 

2 

1 

9  8  8  6 

1 

4 

(3.18) 


Now  we  perform  the  division  beneath  the  pivot,  producing  the  multipliers  in  the 
lower  three  rows  in  the  leftmost  column  of  A.  When  this  is  done,  we  perform  the 
arithmetic  in  G  =  A{(k  +  1) :  m,  (k  +  1) :  n)  =  A(2  :  4,2  :  4).  For  GF,  we  do  not 
replace  the  multipliers  with  zeros.  We  shall  find  that  the  multipliers  are  very  useful 
in  the  end.  The  result  is 


A  = 


9    4    7    2 

5/9  34/9  37/9  26/9 

5/9  7/9  1/9   8/9 

14  14 


(3.19) 


Next,  with  G  being  the  lower  right  (3  x  3)  block  of  A,  we  search  G  for  the  next  pivot 
and  find  that  A(s,t)  =  A(2,  3)  holds  (37/9),  the  largest  second  pivot  candidate. 


d.    Stage  Two 

W'e  increment  the  stage  counter  (k  =  2),  so  that  it  points  to  the  new 
pivot  location,  .4(2,2).  Since  s  =  k,  we  know  that  no  row  interchange  is  necessary 
and  q  will  not  change.  We  must,  however,  swap  columns  k  =  2  and  t  =  3.  The  result 
is: 


A  = 


9    7    4    2" 

'  4  * 

'  3 

5/9  37/9  34/9  26/9 

3 

2 

5/9  1/9   7/9   8/9 

v  — 

2 

9  — 

1 

114    4 

1 

4 

(3.20) 


Once  again,  we  divide  everything  under  the  pivot  by  the  value  of  the  pivot  and 
update  G.  This  yields 


A  = 


9  7     4      2 

5/9  37/9   34/9  26/9 

5/9  1/37  25/37  30/37 

1  9/37  114/37  122/37  . 


(3.21) 
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e.    Stage  Three 

Now  G  becomes  the  (2  x  2)  lower  right  block  of  A  and  the  next  pivot 
(122/37)  is  found  at  A{sJ)  =  ,4(4,4).  Since  k  —  3  we  must  interchange  rows  3  and 
4  as  well  as  columns  3  and  4.  The  result  of  the  permutation  is 


A  = 


9  7     2      4 

5/9  37/9  26/9  34/9 

1  9/37  122/37  114/37 

5/9  1/37  30/37  25/37 


P  = 


4  ' 

'  3 

3 

1 

9  = 

2 

4 

2 

1 

(3.22) 


Then,  dividing  at  the  bottom  of  the  pivot  column  and  updating  G,  we  have 


A  = 


9    7     2  4 

5/9  37/9  26/9  34/9 

1  9/37  122/37  114/37 

5/9  1/37  15/61  -15/183 


(3.23) 


/.    Stage  Four 


The  final  stage,  where  k  =  A  =  min(m,n),  is  always  trivial.  We  need 
only  to  verify  that  q44  is  nonzero.  This  tells  us  that,  indeed,  A  is  nonsingular.  There 
is  no  arithmetic  to  perform,  so  (3.23)  is  the  final,  factored,  copy  of  A. 

g.    Summary 

Using  the  Gauss  factorization  process  we  have  systematically  trans- 
formed the  matrix  A  £  3ft4 x4  into  a  form  that  factors  the  original  version  of  A.  At 
this  point  the  factorization  itself  has  not  been  discussed,  only  the  process  whereby 
we  claim  to  have  factored  A.  Before  we  explore  the  resulting  factorization,  let  us 
consider — in  a  general  way — what  happens  in  any  stage,  k,  of  GF. 

3.    One  Stage  of  Gauss  Factorization 

The  most  important  part  of  GF  is  the  factorization  that  it  produces. 
The  GF  process  is  reversible  (pivots  and  other  key  information  become  part  of  the 
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factorization).  This  section — using  block  matrix  notation  and  induction  on  the  stage 
number — illustrates  the  effect  of  one  stage  of  GF.  The  proof  shows  that  we  can 
perform  an  n-step  Gauss  factorization  A  =  LR,  with  L  unit  lower  triangular  and  R 
right  (upper)  triangular  with  nonzero  diagonal  elements.  Before  the  proof,  however, 
let  us  consider  a  concrete  illustration  where  n  =  15. 

Let  <S>  denote  those  elements  that  Gauss  has  fixed  in  both  value  and  position. 
The  x  symbol  marks  elements  that  are  subject  to  permutations  but  not  changes  in 
value.  Those  elements  that  are  subject  to  both  permutation  and  changes  in  value 
are  indicated  by  the  ©  symbol.  Elements  in  the  pivot  row  are  marked  with  the  G 
symbol  and  the  symbol  0  denotes  elements  beneath  the  pivot.  White  space  indicates 
zeros,  q  is  the  pivot,  and  any  pi  was  a  former  pivot  (in  stage  i).  Let  k  =  7.  Then 
the  leftmost  7  columns  of  R7  are  already  fixed  in  upper  triangular  form  and  L-  is 
unit  lower  triangular  with  the  special  form  described  above.  L'pon  entering  stage 
(k  + 1)  =  8  of  the  Gauss  factorization  process,  the  matrices  L-  and  R-;  would  appear 
as  shown  below: 


L-  = 


1 

® 

1 

(8) 

® 

1 

® 

® 

® 

1 

<8> 

18 

® 

® 

1 

<8> 

® 

(8) 

(8 

8 

1 

(E) 

8 

® 

® 

® 

® 

1 

e 

e 

9 

e 

e 

e 

e 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

(3.24) 
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Ri  = 


Pi®<g><g)®®®®xxxxxxx 
/>2<8>®0000xxxxxxx 

^30®®00XXXXXXX 

p40®0®xxxxxxx 

p5®0®XXXXXXX 

P6       00XXXXXXX 

/>7®XXXXXXX 

oeeeeeee 

00000000 
00O00000 
00000O00 

0000000© 
00000000 

000O000O 
00000000 


(3.25) 


With  this  illustration  in  mind,  let  us  prove  the  effect  of  GF. 

Proposition:  Given  A  6  3?nxn.  Let  Lt  €  3£nxn  be  the  unit  lower  triangular  matrix 
with  7n_, — the  (n-i)x(n-i)  identity — as  its  lower,  right-hand  block.  Let  Rt  E  3ftnXn 
be  the  matrix  that  is  upper  right  triangular  in  its  leftmost  i  columns.  Initially,  let 
A  =  L0Rq  with  L0  =  I  and  Rq  =  A.  Let  P{k)  be  the  proposition:  "Stage  k  of  the 
Gauss  factorization  process  yields  the  factorization,  A  =  Z-^/?^." 

To  Show:  P(k)  =>  P(k  +  1)  for  0  <  k  <  (n  -  1). 

Assumptions:  Pivoting,  according  to  any  valid  strategy,  is  performed  outside  of 
this  factorization  procedure  and  the  pivoting  strategy  yields  pivots,  a  ^  0. 


53 


Notation:  We  can  partition  A  so  that 


A  = 


'a 

T  ' 

y 

X 

G 

(3.26) 


where  a  £  3ft  is  the  initial  pivot,  x  6  3ftn-1  holds  the  values  beneath  the  pivot, 
y  £  9£n-1  holds  the  values  of  the  elements  in  the  pivot  row  to  the  right  of  the  pivot, 
and  G  €  S^"-1)*'"-1)  is  the  Gauss  transform  area. 

Basis  for  Induction:  We  must  show  that  P(0)  =>  P{1).  P{0)  means  that  L0  =  In 
and  Rq  —  A.  That  is,  Rq  has  no  special  structure  except  (by  assumption)  we  are 
guaranteed  a  nonzero  pivot  a.  Consider  stage  k  =  1  of  Gauss  factorization.  Let  us 
partition  A  as  above  and  factor 


A  = 


a    y1 

'  1    0T' 

T  ' 
p    r 

x    G 

I    I 

0    B 

=  L^Ri 


(3.27) 


where  B,  £,  r,  and  p  (with  the  obvious  sizes)  are  defined  as 

p  =  a 
r  =  y 


<=;- 


(3.28) 
(3.29) 

(3.30) 


B  =  G-£rT  (3.31) 

Thus,  given  A  =  L0Ro,  Gauss  factors  A  =  L^Ri  and  P(0)  =»  P(l). 

Inductive  Step:  Consider  the  matrices  Lk  and  R^  that  are  submitted  to  stage 
(k+  1)  of  a  Gauss  factorization  procedure.  We  make  the  inductive  step  to  show  that 
P(k)  =$>  P(k  +  1).  For  0  <  k  <  n,  A  =  LkRk  may  be  partitioned  so  that 


A  = 


LOO 
mT  1  0 
N      0    I 


R 


T 


0T    a    yT 


=  LkRk 


(3.32) 
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where  L  £  $lkxk  is  a  unit  lower  triangular  matrix  and  R  G  9ft*  X*  is  a  right  (upper) 
triangular  matrix  with  nonzero  diagonal  elements. 

The  Gauss  process  forms  p  as  in  (3.28),  r  as  in  (3.29),  multipliers,  £  as 
in  (3.30),  and  B  as  in  (3.31).  Then,  for  0  <  k  <  (n  -  1),  GF  forms 


A  = 


LOO 
mT  1  0 
Nil 


R     s    T 
0T    p    rT 


0       0     G 
Thus,  for  0  <  k  <  n,  P(k)  =>  P(k  -r  1).  [Ref.  24] 


=  Lk+\R 


Hl^+l 


(3.33) 


Conclusion:  The  nonsingular  matrix  A  G  9ftnxn  can  be  factored,  in  n  steps  of  the 
Gauss  factorization  process,  so  that  A  =  LR  with  L  being  unit  lower  triangular  and 
R  being  upper  triangular  with  nonzero  diagonal  elements. 

The  proof  has  demonstrated  the  effect  of  GF.  For  simplicity,  it  excluded 
the  pivoting  strategy  (simply  assuming  that,  at  every  stage,  a  pivot  q^O  would  be 
available).  It  also  held  A  square.  In  this  sense  the  proof  is  somewhat  specific.  There 
is  a  more  general  conclusion  to  be  made.  This  conclusion  holds  for  GF  with  pivoting 
and  0  /  A  G  3ftmxri  and  it  is  absolutely  essential  to  understanding  the  factorization. 

4.   The  LR  Theorem 

With  the  GF  process  complete,  and  the  vast  majority  of  the  work  done, 
we  show  how  to  form  a  solution  from  our  factorization.  Various  methods  of  pivoting 
(resulting  in  permutation  vectors)  and  the  method  whereby  A  is  factored  have  been 
discussed.  To  solve  the  system,  we  must  put  all  of  this  information  together.  The 
key  is  the  LR  Theorem  [Ref.  24]: 

Theorem  3.1  (LR  Theorem)  Let  0  ±  A  G  &mXn.  Then  there  art  permutation 
matrices  P  £  3?nXn  and  Q  G  3£mXm,  an  integer  r  >  1,  a  lower  trapezoidal  matrix 
L  G  3?rnxr  and  an  upper  (right)  trapezoidal  matrix  R  G  £rXn  so  that  QTAP  =  LR. 
The  diagonal  elements  of  L  satisfy  A,it  =  1  with  i  =  l,2,...,r  and  the  diagonal 
elements  of  R  satisfy  pt ;)t-  ^  0  for  i  =  1,2, . . .  ,r. 


55 


5.    Filling  in  the  Blanks 

a.    The  Main  Factors 

GF  used  the  space  of  A  to  hold  the  two  principal  matrices,  L  and  R, 
in  the  factorization  of  A.  To  see  them,  we  will  extract  the  lower  triangular  matrix, 
L,  and  upper  (right)  triangular  matrix,  R,  from  the  final  copy  of  A  (3.23).  Initially, 
let  L  =  R  =  0.  We  form  L  by  placing  ones  on  its  diagonal  and  filling  the  elements 

below  the  diagonal  from  the  corresponding  locations  in  A. 

1 

5/9 


L  = 


0  0  0 

1  0  0 

1      9/37        1  0 

5/9    1/37  15/61  1 

R  is  formed  with  the  diagonal  elements  (i.e.,  pivots)  and  upper  triangle  of  A. 

9       7            2  4 

0  37/9      26/9  34/9 

0       0  122/37  114/37 

0       0            0  -15/183 


(3.34) 


R  = 


(3.35) 


b.    Permutation  Matrices 


The  bookkeeping  allows  us  to  construct  P  and  Q  very  quickly.  To  form 
P  E  3?nXn,  we  set  every  column,  j,  in  P  equal  to  the  axis  vector  implied  by  7r;,  the 
jth  element  of  p.  This  yields  the  permutation  matrix,  P,  that  will  satisfy  the  LR 
Theorem,  namely 


"    TTl 

"   4    ' 

7T2 

3 

7T3 

1 

.   ^4    . 

.  2  . 

p  = 


e4    e3    t\    e2 


0 

0 

1 

0  ' 

0 

0 

0 

1 

0 

1 

0 

0 

1 

0 

0 

0  . 

(3.36) 


Similarly,  every  column,  j,  in  Q  6  3£mxm  is  set  equal  to  the  axis  vector  implied  by 
V>j,  the  jth  element  of  q.  For  our  example,  we  have 


9  = 


"V'i  " 

"  3  " 

02 

2 

03 

4 

.    V>4. 

.  1  . 

Q  =     e3    e2    e4     tx 


0 

0 

0 

1  ■ 

0 

1 

0 

0 

1 

0 

0 

0 

0 

0 

1 

0  . 

(3.37) 
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c.    Check 

Now  we  check  to  make  sure  that  our  solution  satisfies  the  LR  Theorem. 
First,  consider  the  product  LR: 


LR  = 


10           0  0 

5/9       1           0  0 

1      9/37        1  0 

5/9    1/37  15/61  1 


And 


QTAP  = 


0    0    10" 

0    10    0 

0    0    0    1 

10    0    0. 

=  (QTA)P  = 


9  7     2  4 

0  37/9   26/9  34/9 

0  0  122/37  114/37 

0  0     0  —15/183 


9  7  2  4 

5  8  4  6 

9  8  6  8 

5  4  2  3 


2  3  4  5 

4  6  8  5 

2  4  7  9 

_  6  8  8  9 

2  4  7  9 

4  6  8  5 

6  8  8  9 

2  3  4  5 


0 

0 

1 

0  " 

0 

0 

0 

1 

0 

1 

0 

0 

1 

0 

0 

0  . 

0 

0 

1 

0  ' 

0 

0 

0 

1 

0 

1 

0 

0 

1 

0 

0 

0  . 

9  7 

5  8 

9  8 

5  4 


Our  factorization  satisfies  QT AP  =  LR. 


d.    Solution 


(3.38) 


(3.39) 


(3.40) 


(3.41) 


(3.42) 


Now  we  solve  the  system.  Recall  that  Gaussian  elimination  operated 
on  the  matrix,  A,  and  the  right-hand  side,  6,  at  the  same  time.  The  end  result  of 
GE  is  that  A  is  reduced  to  upper  triangular  form  by  successive  elimination  of  the 
lower  triangle  so  that  we  could  solve  for  u  with  a  relatively  easy  back  substitution. 

The  strategy  of  Gauss  factorization  is  different.  First,  b  is  not  part  of 
the  factorization  process.   Secondly,  even  though  we  are  changing  A,  we  know  that 
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we  can  get  it  back  at  the  end  (if  we  want  to),  so  there  is  no  need  to  save  the  original 
A.  Now,  using  the  LR  Theorem,  we  complete  the  solution.  Recall  that  the  original 
system  was 

Au  =  6.  (3.43) 

The  factorization  process  constructs  permutation  matrices  P  and  Q  and  transforms 
the  original  matrix  A  into  a  combined  version  of  L  and  R.  Further  (by  the  LR 
Theorem)  we  know  that  these  matrices  satisfy 

QTAP  =  LR.  (3.44) 

Now,  by  multiplying  (3.44)  through  by  Q  from  the  left  and  PT  on  the  right,  we  see 
that 

QQTAPPT  =  QLRPT.  (3.45) 

Performing  the  cancellations  on  the  left-hand  side,  we  have 

A  =  QLRPT.  (3.46) 

This  is  the  factorization  of  A.  Substituting  this  into  (3.43)  yields 

QLRPTu  =  b  (3.47) 

or 

LRPTu  =  QTb.  (3.48) 

Now  let  6  =  QTb  and  let  it  =  PTu.  Then 

LRu  =  b.  (3.49) 

Further,  let  Ru  =  c  for  some  unknown  vector,  c.  We  have 

Lc  =  b.  (3.50) 
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Since  we  know  L  and  6,  we  may  solve  for  c  by  a  simple  forward  substitution.  Then, 
using  c  and  knowing  that  Ru  =  c,  we  perform  a  simple  back  substitution  and  deter- 
mine u.  Finally,  by  definition,  u  =  PTu  (i.e.,  u  is  a  mere  permutation  of  u)  so  we 
can  swap  elements  in  u  to  arrive  at  u  using  Pii  =  u. 

Let  us  summarize  this  lengthy  process  into  the  main  steps.  The  GF 
process  factors  A  =  QLRP7 ',  changing  the  general  matrix  into  a  product  where  the 
most  significant  factors  are  both  triangular.  This  reduces  the  hard  problem  to  two 
easy  ones.  It  is  designed  so  that  we  can  solve  for  u  in  two  steps: 

•  Solve,  by  forward  substitution,  the  system  Lc  =  b  for  a  vector,  c,  of  unknowns. 

•  Solve,  by  back  substitution,  the  system  Ru  =  c  for  (a  permutation  of)  the 
original  unknowns,  u. 

So.  for  our  example,  the  first  step  is  to  solve 


Lc  = 


10     0  0 

5/9   1     0  0 

1   9/37    1  0 

5/9  1/37  15/61  1 


c? 
c4 


=  QTb  = 


"   13  ■ 

'A  " 

-5 

h 

-17 

A 

0  . 

k 

=  b       (3.51) 


Forward  substitution,  applied  to  this  system,  yields 


c  = 


C\    ' 

13 

c2 

-110/9 

c3 

-1000/37 

c4  . 

-15/61 

(3.52) 
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Now  we  know  c,  so  we  can  solve  the  second  triangular  system,  Ru  =  c  for  u  by  back 
substitution 


Ru  = 


which  yields 


9  7 

0  37/9 

0  0 

0  0 


2 

26/9 

122/37 

0 


4 

34/9 

114/37 

-15/183 


v3 

V~A 


13 

-110/9 

-1000/37 

-15/61 


=  c 


(3.53) 


u  = 


'  V\   ' 

'    1    " 

V2 

v3 

= 

2 

-11 

.  v4  . 

3 

(3.54) 


Now  it  is  easy  to  recover  u.  Since  we  have  defined  u  =  PTu,  we  know 
that  Pu  =  u  (a  simple  rearrangement  of  the  elements  that  we  have  already  found). 
We  apply  P  to  u  and  find  that 


Pu  = 


0    0    10" 

"  tfl  ' 

'  v3  ' 

r   -11    " 

0    0    0    1 

V2 

v4 

3 

0    10    0 

V~3 

V2 

2 

10    0    0. 

.  v4  . 

.   V~!    . 

1 

(3.55) 


Comparing  this  to  earlier  solutions,  we  find  that  GF  has  arrived  at  the  same  solution. 
In  these  examples,  the  notion  of  elimination  was  developed  first.  The 
GE  process  performs  successive  eliminations  beneath  its  pivots  and  reduces  A  to 
triangular  form,  and  then  the  solution  is  available  in  only  n2  flops.  GF  spends 
an  almost  identical  amount  of  work  in  the  reduction  process,  but  the  result  is  a 
factorization  with  L  and  R  being  the  significant  factors.  (They  are  the  only  ones 
that  are  more  than  a  permutation  of  the  identity).  In  the  examples,  we  used  pivoting 
because  it  was  practical.  Now  let  us  take  a  closer  look  at  the  justifications  for 
pivoting. 
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F.   PIVOTING  FOR  SIZE 

The  issue  of  pivoting  is  a  very  interesting  and  important  one.  We  concluded  that 
we  must  pivot  or  face  the  possibility  of  attempting  to  divide  by  zero,  an  unacceptable 
option.  To  solve  this  problem,  we  may  pick  any  nonzero  element  in  A(k  :m,k:n) 
and  perform  the  column  and  row  interchanges  required  to  install  it  as  the  new  pivot 
(k  is  the  pivot  index).  There  are  many  strategies  that  we  could  adopt. 

The  logical  question  would  be  something  like:  "Given  that  we  must  pivot,  what 
is  the  best  means  available?"  But  the  answer  is  not  so  easy,  and  there  are  many 
trade-offs  to  be  considered.  We  are  faced  with  choosing  along  a  spectrum,  where 
speed  lies  at  one  end  and  accuracy  lies  at  the  other.  For  instance,  we  could  begin  a 
search  and  pick  the  first  nonzero  element  in  this  area.  Or,  we  could  search  for  the 
row  with  the  most  nonzero  elements  (that  had  a  nonzero  element  in  the  kih  column). 

The  two  most  common  strategies  for  pivoting  are  the  partial  and  complete  meth- 
ods, which  we  have  discussed.  We  determined  that  partial  pivoting  would  work  per- 
fectly (with  no  error)  if  A  was  nonsingular  and  the  storage  and  arithmetic  could  be 
handled  with  infinite  precision.  If  infinite  precision  were  available,  we  could  stop 
right  here.  There  would  be  no  need  to  try  to  refine  the  method.  In  a  finite-precision 
machine,  however,  we  must  deal  with  the  issue  of  errors. 

To  deal  with  errors,  the  problem  must  be  stated  more  precisely.  The  errors 
that  concern  us  would  arise  due  to  growth  of  the  elements  of  L  and/or  R  as  we  step 
through  the  stages  of  Gauss.  In  the  end,  partial  pivoting  guarantees  that  all  of  the 
elements  of  L  will  be,  at  most,  unity.  This  is  easy  to  see.  The  pivoting  strategy 
chooses  each  pivot  to  be  the  largest  element  (in  absolute  value)  in  column  k  at  or 
below  row  k.  This  value  is  installed  at  A(k,  k)  and  everything  below  the  pivot  is 
divided  by  the  pivot. 
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Unfortunately,  partial  pivoting  cannot  make  the  same  guarantee  for  the  ele- 
ments of  R.  It  helps:  the  multipliers  are  less  than  or  equal  to  one  in  absolute  value. 
The  elements  of  R  are  bounded  by  2n-1a,  where  a  is  the  largest  absolute  value  of 
the  elements  in  A.  This  bound  is  not  normally  attained  "in  practice11.  [Ref.  23] 

Growth  is  an  indicator  of  trouble  in  this  process.  If  we  cannot  control  it  com- 
pletely, we  should,  at  a  minimum,  monitor  it.  The  growth  factor,  p(n),  of  a  Gauss 
factorization  process  for  A  £  3ftnXn  is  defined  as  follows.  Let  a  be  the  largest  absolute 
value  in  the  original  matrix,  A.  Let  6  be  the  largest  absolute  value  that  occurs  in 
any  Gauss  transform,  G,  including  the  first  one,  G  =  A.  Then  g(n)  =  b/a  gives  a 
growth  factor  normalized  by  a  (i.e.,  g(n)  >  1). 

A  great  deal  of  analysis  has  been  done  on  this  subject.  Wilkinson  showed 
that,  with  complete  pivoting  and  real  matrices,  g(n)  grows  much  more  slowly  than 
2".  He  conjectured  that  g(n)  <  n.  The  latter  has  recently  been  disproved,  with  a 
counterexample  by  Nicholas  Young.  [Ref.  23] 

As  a  practical  matter,  when  one  seeks  to  monitor  growth  one  uses  complete 
pivoting.  To  consider  performance,  one  uses  the  partial  pivoting  strategy.  The 
growth  factor,  g(n),  is  easy  to  monitor  with  a  complete  pivoting  strategy  since  we  are 
moving  through  the  entire  Gauss  transform  area  at  each  stage  anyway.  For  clarity, 
the  pivoting  algorithms  and  the  Update  algorithm  are  listed  separately  in  this 
chapter.  In  real  code  (e.g.,  Appendix  F),  however,  the  pivot  for  stage  (k+  1)  should 
be  located  during  the  update  of  G  in  stage  k  (to  avoid  unnecessary  passes  through 
the  matrix).  This  would  mean  extra  work  in  the  partial  pivoting  algorithm.  Since 
the  primary  reason  for  using  partial  pivoting  is  performance,  it  is  counterproductive 
to  monitor  g(n)  while  using  partial  pivoting.  A  description  of  both  pivoting  policies, 
in  algorithm  form,  follows. 
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Algorithm  3.1  (Partial  Column  Pivoting  for  Size)  Given  the  matrix  of  coef- 
ficients, A  €  9?mXn;  a  permutation  vector,  q  €  3ftm;  and  an  index,  k,  indicating  the 
pivot  column,  this  algorithm  performs  partial  pivoting.  First,  the  pivot  element  is 
located  at  A(s,  k)  with  s  >  k.  Once  the  pivot  has  been  located,  rows  s  and  k  are 
swapped  to  install  the  new  pivot.  Additionally,  elements  in  q,  indexed  by  s  and  k, 
art  swapped  to  record  the  row  interchanges. 


begin  PP 

5  =  k; 

for  i  =  {k+  1)  : 

m 

it(\A(i,k)\: 

>  \A(i 

»,*)l) 

s  =  i\ 

end  if 

end  for 

if  (s  ?  k) 

for  j  =  1  :  n 

x  =  A(k, 

i); 

A(k,j)  = 

■■A(s, 

;); 

AM)  = 

x\ 

end  for 

i  =  q(k); 

q(k)  =  q{s); 

q(s)  =  t; 

end  if 

end  PP 
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Algorithm  3.2  (Complete  Pivoting  for  Size)  Given  the  matrix  of  coefficients, 
A  E  9ftmx";  permutation  vectors,  p  £  9£n  and  q  E  %tm ;  and  an  index,  k,  indicating  the 
pivot  row  and  column,  this  algorithm  performs  complete  pivoting.  First,  the  pivot 
element  is  located  at  A(s,t).  Once  the  pivot  has  been  located,  rows  s  and  k  and 
columns  t  and  k  are  swapped  to  install  the  new  pivot.  The  permutation  vectors  are 
updated  accordingly. 


begin  PC 

s  =  k; 

t  =  k; 

for  i  =  k  :  m 

for  j  =  k  :  n 

X(\A(i,j)\ 

> 

\A(s, 

01) 

s  =  i; 

<  =  i; 

end  if 

end  for 

end  for 

(locate  the  pivot) 


if  (s  ^  k)  (row  interchanges) 

for  j '•  =  1  :  n 

x  =  A(kJ);       A(k,j)  =  A{sJ);       A(s,j)  =  x; 
end  for 

i  =  q(k):       q(k)  =  q{s);       q(s)  =  i\ 

end  if 

if  (t  ^  k)  (column  interchanges) 

for  i  =  1  :  m 

x  =  A(i,k);       A(i,k)  =  A{i,t);       A{i,t)  =  x; 
end  for 

*  =  P(*0;       P(k)  =  p(t);      p{t)  =  i; 
end  if 
end  PC 
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G.   SEQUENTIAL  ALGORITHMS 

The  examples  considered  have  described  the  Gauss  process.  We  first  considered 
elimination  (GE)  and  then  a  factorization  method  (GF).  Both  methods  require  work 
of  the  same  order,  so  the  latter,  yielding  a  factorization  of  A  is  much  preferred. 
Algorithms  for  the  GF  process  are  described  below.  The  arithmetic  in  the  Gauss 
transform  area,  G,  is  performed  the  same  (regardless  of  pivoting  strategy)  so  a 
separate  algorithm  is  given  for  updating  G.  The  algorithms  GFPP  (pivoting,  partial) 
and  GFPC  (pivoting,  complete)  are  given  following  the  updating  algorithm.  These 
algorithms  are  adapted  from  Gragg  [Ref.  23]. 


Algorithm  3.3  (Update  Gauss  Transform  Area)  Given  the  matrix  of  coeffi- 
cients, A  £  3£mXn;  and  k,  the  pivot  column,  this  algorithm  performs  the  appropriate 
arithmetic  throughout  the  pivot  column  and  Gauss  transform  area,  G,  of  A. 


begin  Update 
x  =  A(k,ky, 

for  i  =  (k  +  1)  :  m 

A(i,k)  =  A(i,k)/x; 
end  for 

for  i  =  (k  +  1)  :  m 

x  =  A(i,  k); 

for  j  =  1  :  n 

A(i,j)  =  A(iJ)-x  x  A{k,j); 
end  for 

end  for 

end  Update 


(x  is  the  pivot  value) 
(pivot  column  division) 


(arithmetic  in  G) 
(now  x  is  the  multiplier) 
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Algorithm  3.4  (Gauss  Factorization  with  Partial  Pivoting)  Given  the  matrix 
of  coefficients,  A  €  3ftnxn,  this  algorithm  modifies  (overwrites)  A  with  a  unit  lower 
triangular  matrix  (with  an  implicit  diagonal),  L  £  3£nXn,  and  an  upper  (right)  trian- 
gular matrix,  R  €  3ftnXn  having  nonzero  diagonal  elements  (the  pivots).  The  process 
also  forms  the  row  permutation  vector,  q,  and  the  corresponding  permutation  matrix, 
Q  6  3ftnXn,  that  results  from  partial  column  pivoting  for  size.  The  algorithm  gives 
the  factorization:  Q  A  =  LR. 

begin  GFPP 

n  =  OTder(A) 

Q  =  zeros(n,  n) 

for  j  =  1  :  77 

q{j)  =  j;  (initialize  q) 

end  for 

for  r  =  1  :  n  (the  Gauss  process) 

PP{A,q.k)  (pivoting) 

if  (A{k,k)  =  0) 

print  "A  is  singular!" 

exit 
end  if 

Update(^U-)  (Update  G) 

end  for 

for  j  =  1  :  n 

Q{q{j),j)  =  i.O; 

end  for 
end  GFPP 
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Algorithm  3.5  (Gauss  Factorization  with  Complete  Pivoting)  Given  a  ma- 
trix of  coefficients,  A  6  9?mXn,  the  following  algorithm  modifies  (overwrites)  A  with 
a  unit  lower  trapezoidal  matrix  (with  implicit  diagonal),  L  €  3ftmXn,  and  an  upper 
(right)  trapezoidal  matrix,  R  £  9ftmXT\  The  diagonal  elements  of  R  are  nonzero  (piv- 
ots). The  process  forms  permutation  matrices,  P  6  9ftnXn  and  Q  £  3?mxm,  to  reflect 
the  complete  pivoting  for  size.  These  matrices  are  formed  to  satisfy  the  LR  Theorem: 
QTAP  =  LR. 


begin  GFPC 

m  =  rows(^l); 


n  =  cols(/l); 


(initialization) 


P  =  zeros(n,n);       Q  =  zeros(m,m); 
for  j  =  1  :  n 

pU)  =j; 

end  for 

for  i  =  1  :  m 

q(i)  =  i] 
end  for 

for  r  =  \  :  n 

PC(A,q,k) 

\f(A{k,k)  =  0) 

print  UA  is  singular!1' 

exit 
end  if 

Update(4,  Jfe) 

end  for 

for  j  =  1  :  n 

P(p(j)j)  =  1.0; 
end  for 

for  j  =  1  :  m 

Q(q(j)j)  =  1.0; 
end  for 

end  GFPC 


(the  Gauss  process) 
(pivoting) 


(Update  G) 


(form  P) 


(form  Q) 
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H.   CONJUGATE  GRADIENTS 

Time  permits  only  a  brief  synopsis  of  the  method  of  conjugate  gradients  (CG). 
This  method  was  described  by  Magnus  R.  Hestenes  and  Eduard  Stiefel  [Ref.  18]. 
CG  possesses  some  very  nice  characteristics  and  it  is  quite  different  from  the  Gauss 
method.  Once  again,  we  begin  with  a  system  of  linear  equations 

Au  =  b  (3.56) 

The  algorithm  given  by  Hestenes  and  Stiefel  is  designed  for  A  €  9£nXn  symmetric 
and  positive  definite  (Appendix  A).  Let  5  E  $n  be  the  vector  that  would  solve  (3.56) 
exactly,  so  that  As  =  6.  Let  u,  6  3£n  be  the  estimate  of  the  solution,  s,  produced 
in  the  ith  iteration.  The  original  estimate,  u0,  is  merely  a  guess  (it  may  be  a  good 
guess).  For  instance,  in  the  absence  of  better  information,  we  could  choose  uo  to  be 
the  vector  of  all  zeros  or  all  ones. 

The  CG  process  takes  our  initial  guess  and  develops  a  (guaranteed)  better 
estimate  for  the  next  stage.  To  measure  the  progress,  we  could  use  the  residual 
vector 

r, :  =  b  -  Aui  (3.57) 

but  Hestenes  and  Stiefel  warn  that  its  Euclidean  norm,  ||  r,  ||2,  may  actually  increase 
in  every  step  but  the  last!  A  more  reliable  measure,  called  the  error  vector 

ct-  =  s  —  Ui  (3.58) 

has  monotonically  decreasing  length.  After  n  iterations  of  the  CG  process,  we  are 
guaranteed  to  have  a  very  good  estimate  un  of  s.  In  fact,  if  no  rounding  errors 
occur,  we  have  un  =  s.  In  practice,  CG  can  find  a  very  good  estimate,  um,  of  s 
in  m  iterations,  with  m  <C  n.  The  process  "terminates  in  at  most  n  steps  if  no 
rounding-off  errors  are  encountered."  [Ref.  18:    p.  410] 
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The  algorithm  below  is  adopted  from  Hestenes  and  Stiefel  [Ref.  18].  Before 
considering  the  algorithm,  however,  we  should  define  the  key  term,  conjugate.  For 
A  symmetric,  two  vectors  x  G  3ft"  and  y  G  3£n  are  said  to  be  A-orthogonal  (or 
conjugate)  if  the  relation  x7 Ay  =  (Ax)  y  =  0  holds  [Ref.  18  :  p.  410].  This  is 
an  extension  of  vector  orthogonality,  xTy  =  0.  The  algorithm  given  below  is  very 
simple.  The  iteration  blindly  proceeds  from  i  =  0  to  i  —  n.  A  more  sophisticated 
(finite  precision)  scheme  would  set  a  tolerance  (notion  of  "good  enough")  and  stop 
(exit  the  loop)  when  this  criterion  was  satisfied. 

Algorithm  3.6  (The  Method  of  Conjugate  Gradients)  Given  the  symmetric, 
positive  definite  matrix  of  coefficients,  A  G  3ftnx";  and  an  initial  guess,  Uq;  for  the 
solution,  s;  of  the  system  Au  =  b,  this  algorithm  (in  the  absence  of  rounding-off 
errors)  finds  vt  =  s  in  i  iterations  (i  <  n).  The  algorithm  keeps  track  of  a  residual 
vector,  n,  and  direction  vectors,  pt.  The  residuals,  r,,  are  mutually  orthogonal  and 
the  direction  vectors,  p,  are  mutually  conjugate  (A-orthogonal). 

begin  CG 

u0  =zeros(r?)  (arbitrary  initial  guess) 

Po  =  r0  =  b  -  AuQ 

for  i '  =  0  :  n 

6  =  pj Apx  (denominator  used  below) 

a,  =  {pJrt)/6  (scalar  multiplier  used  below) 

tz,+i  =  ux  +  ctiPi  (estimate  of  solution) 

r,+1  =  r,  —  ct{Api  (residual  vector) 

ft  =  (r£.,r.-)/* 

Pt+i  =  r,+i  +  /?,p,  (direction  vector) 

end  for 
end  CG 
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I.   SUMMARY 

This  chapter  develops  the  Gaussian  elimination  process,  the  Gauss  factoriza- 
tion process,  pivoting  strategies,  and  (briefly)  the  method  of  conjugate  gradients. 
Each  of  the  corresponding  algorithms  possesses  potential  for  parallel  solution.  A 
parallel  implementation  of  GF  appears  in  the  following  chapter.  Both  partial  and 
complete  pivoting  are  pursued,  with  further  discussion  on  their  implications  in  a 
parallel  environment. 
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IV.    PARALLEL  DESIGN 

Nature  is  pleased  with  simplicity,   and  affects  not  the  pomp  of  superfluous 
causes. 

—  SIR  ISAAC  NEWTON  (1642-1727) 

Sequential  algorithms  for  Gauss  factorization  (GF)  and  the  method  of  conjugate 
gradients  (CG)  are  established  in  Chapter  III.  The  goal  of  this  chapter  is  to  show 
parallel  algorithms  for  Gauss  factorization.  The  C  programs  that  implement  these 
algorithms  are  discussed  in  Chapter  V  and  listed  in  Appendix  F. 

Parallel  algorithm  design  is  a  process  that  includes  many  considerations.  The 
question  of  how  to  achieve  parallelism  is  largely  an  art  and  is  not  discussed  here. 
The  method  used  in  this  research  is  often  called  a  work/arm  approach  because  the 
algorithm  farms  out  work  to  processors.  Equivalently,  it  may  be  called  a  manager- 
worker  model.  When  we  distribute  the  problem  across  many  processors  in  a  workfarm 
style,  there  are  quite  a  number  of  issues  that  warrant  careful  consideration.  The 
concerns  associated  with  programming  a  parallel  machine — even  with  a  relatively 
simple  model  such  as  this — could  occupy  volumes. 

Communications,  load  balancing,  granularity,  and  other  considerations  abound. 
Metrics  like  speedup  and  efficiency  should  be  used  to  lend  credibility  to  the  parallel 
nature  of  the  algorithm.  Additionally,  we  should  consider  the  usual  issues  of  main- 
tainability, readability,  portability,  and  other  traits  commonly  associated  with  good 
(sequential)  programming  practice.  Parallel  codes  must  be  clear  combinations  of 
sequential  codes  that  are  joined  together  in  a  logical  manner.  Simplicity  should  hold 
a  place  of  great  esteem  in  a  parallel  algorithm.  The  rest  of  this  chapter  introduces 
the  issues  of  parallel  design,  particularly  as  they  pertain  to  Gauss  factorization. 
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A.  INTERPROCESSOR  COMMUNICATIONS 

Interprocessor  communication  is  one  of  the  most  fundamental  issues  in  parallel 
processing  and,  quite  possibly,  the  most  involved.  Without  a  means  of  communicat- 
ing (in  a  message-passing  environment),  the  multiprocessor  system  is  meaningless. 
The  implications  of  any  communications  scheme  are  many  and  the  interactions  can 
be  quite  complex.  Exhaustive  coverage  of  this  issue  is  out  of  the  question,  so  we  will 
consider  a  few  of  the  most  essential  ideas. 

1.   The  Network 

A  network  is  the  part  of  a  multiprocessor  system's  hardware  that  bears 
the  interprocessor  communications  burden.  It  is  a  combination  of  nodes  and  links 
that  connect  those  nodes,  and  it  is  the  foundation  upon  which  all  communications 
must  build.  We  will  also  refer  to  the  nodes  of  a  multiprocessor — using  somewhat 
loose  terminology — as  processors.  The  term  node  is  a  more  general  term.  Nodes 
are  typically  more  sophisticated  than  a  simple  central  processing  unit  (CPU)  or,  for 
that  matter,  any  other  sort  of  processor.  The  link  is  a  wire  that  connects  two  nodes. 
An  interconnection  topology  describes  the  pattern  of  links  used  to  connect  the  nodes 
of  a  network.  The  network  can  be  drawn  or  illustrated  so  that  we  can  see  how  its 
nodes  are  connected.  Appendix  C  discusses  interconnection  topologies  and  it  gives 
a  description  (and  illustrations)  of  the  particular  scheme  used  in  this  research:  the 
hypercube. 

Intel  combines  an  80386  CPU  with  an  80387  math  coprocessor  and  commu- 
nications facilities  to  form  a  "CX"  node  for  the  iPSC/2  that  was  used  in  this  research. 
INMOS  provides  the  same  general  capabilities  but  packages  it  all  on  a  (very  sophis- 
ticated) single  chip,  called  a  transputer.  Figure  4.1,  from  INMOS'  T9000  Transputer 
Products  Overview  Manual  [Ref.  25:   p.  31],  shows  a  high-level  block  diagram  of  the 
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components  of  a  T9000  transputer.  Thus,  any  node  of  a  message-passing  multipro- 
cessor system  can  be  thought  of  as  a  combination  of  computing  and  communications 
facilities.  It  may  possess  other  capabilities  as  well. 

2.    Message  Routing 

The  machines  used  in  this  research  exhibit  different  message  transmission 
schemes.  The  transputer  system  employs  high-speed  (20  megabits  per  second)  point- 
to-point  serial  communications  and  store-and-forward  message  passing.  That  is,  for 
multi-hop  communications,  each  node  along  the  way  must  receive  the  message,  store 
it  in  local  memory  temporarily,  and  then  pass  it  to  the  next  node  in  the  route. 

The  Intel  iPSC/2  uses  another  technique,  called  circuit  switching  or  direct- 
connect  communications.  This  approach  is  much  like  our  telephone  system.  First, 
the  originator  of  the  message  sends  a  small  message  containing  information  about 
the  message  (e.g.,  destination  node  number,  length  of  message)  to  the  destination 
via  the  nodes  in-between.  As  this  small  header  packet  makes  its  way  to  the  destina- 
tion the  nodes  along  the  way  flip  switches,  closing  a  circuit  from  the  sender  to  the 
receiver.  Once  this  circuit  is  established,  the  message  proceeds  from  the  sender  to 
the  destination  without  interruption. 

Each  method  has  its  advantages  and  disadvantages.  The  circuit  switching 
approach  allows  for  fewer  interruptions  along  the  way,  but  it  ties  up  the  entire  path 
for  the  duration  of  the  communication.  The  store-and-forward  method  imposes 
delays  for  storing  the  message  into,  and  then  retrieving  it  from,  the  memory  of  every 
node  along  the  way.  (A  more  complete  description  of  these  two  techniques,  together 
with  experimental  results,  is  given  in  Appendix  B).  For  the  algorithms  employed  in 
this  research,  almost  all  communications  were  "nearest  neighbor"  in  the  hypercube. 
In  this  case,  the  two  approaches  to  message  routing  are  insignificant  and  the  nearest 
neighbor  performance  becomes  more  important. 
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Figure  4.1:  IMS  T9000  Block  Diagram 
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3.    Concurrent  Computing  and  Communicating 

The  nodes  of  a  multiprocessor  machine  should  be  able  to  both  compute 
and  communicate  efficiently  and  concurrently.  This  is  no  small  undertaking.  The 
computing  side  must  access  memory  to  accomplish  its  mission,  but  the  message- 
passing  begins  by  drawing  data  out  of  memory  and  ends  by  storing  data  into  mem- 
ory. Therefore,  at  a  minimum,  we  have  competition  related  to  memory  accesses. 
Furthermore,  the  computing  and  communication  must  be  synchronized  to  some  ex- 
tent. The  algorithms  used  in  this  research  used  blocking  communications — described 
in  Appendix  E — which  enforces  synchronization. 

There  are  overheads  associated  with  communications  and  this  synchroniza- 
tion problem.  Bryant  showed  how  transputers  perform  under  various  communica- 
tion loads  [Ref.  26]  and  this  is  mentioned  in  Appendix  E.  The  issue  of  overheads 
is  one  that  Charles  Seitz  considered  for  the  "Cosmic  Cube.'1  Much,  but  not  all,  of 
the  overhead  is  communication-related.    Seitz  listed  three  of  the  major  problems 

[Ref.  27:    p.  28]: 

(1)  the  idle  time  that  results  from  imperfect  load  balancing,  (2)  the  wait- 
ing time  caused  by  communications  latencies  in  the  channels  and  in  the  message 
forwarding,  and  (3)  the  processor  time  dedicated  to  processing  and  forwarding  mes- 
sages, a  consideration  that  can  be  effectively  eliminated  by  architectural  improve- 
ments in  the  nodes. 

Included  in  these  costs,  we  should  also  recognize  that  some  amount  of  time  is  required 
for  the  processor  to  perform  "context  switching"  (changing  jobs)  and/or  coordination 
with  a  special-purpose  processor  that  we  might  call  the  communications  manager. 

Although  the  issue  of  concurrent  communication  and  computing  is  a  very 
complex  one,  we  may  consider  significant  issues  that  are  related  to  the  efficiency  of 
communications  and  the  effect  upon  the  processor.  Geoffrey  Fox  presents  the  notion 
of  comparing  communications  ability  to  processing  ability  [Ref.  28:  pp.  50-51].  Let 
tcaic  be  "the  typical  time  required  to  perform  a  generic  calculation.    For  scientific 
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problems,  this  can  be  taken  as  a  floating-point  calculation  a  =  b  x  c  or  a  =  b  -f  c." 
Furthermore,  let  tcomm  be  "the  typical  time  taken  to  communicate  a  single  word 
between  two  nodes  connected  in  the  hardware  topology.'1  Then  the  ratio 

Icomm 
tcalc 

is  a  general  characteristic  of  a  particular  system  that  can  be  quite  useful  in  comparing 
machines.  Fox  uses  this  ratio  in  much  of  the  rest  of  his  work. 

A  parallel  machine  must  necessarily  possess  a  capable  communications  sub- 
system, but  this  is  not  enough.  The  program  should  also  make  prudent  use  of  the 
communications  facilities.  This  means  that  the  programmer  and/or  compiler  must 
exhibit  a  good  understanding  the  machine's  communications  abilities  and  weak- 
nesses. Some  characteristics  are  nearly  universal.  Most  machines,  for  instance, 
reward  the  use  of  long  messages  because  there  is  an  overhead — nearly  independent 
of  message  length  in  many  cases — to  sending  any  message.  Other  characteristics  are 
very  much  machine-dependent.  This  means  that  the  programmer  should  be  rela- 
tively familiar  with  the  communications  abilities  and  characteristics  of  the  target 
machine. 

4.    Accessing  the  Clock 

The  ability  to  accurately  measure  the  time  required  by  communications 
and  computations,  preferably  at  the  host  and  every  node  in  the  system,  is  absolutely 
essential  in  a  multiprocessor  environment.  Profiling,  in  a  sequential  program,  allows 
us  to  compare  the  time  required  by  various  parts  of  a  program.  Timing  in  a  parallel 
environment  allows  us  profile  the  code.  Thus  we  can  determine  the  time  required  for 
instructions,  loops,  functions,  or  communications. 

Profiling  is  an  even  more  important  practice  for  parallel  coding  than  it  is  in 
the  sequential  case.  The  only  way  for  a  parallel  program  to  be  useful  is  if  it  can  be 
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can  be  implemented  efficiently  upon  an  acceptable  number  of  processors.  That  is, 
in  general,  the  only  object  in  choosing  a  multiprocessor  system  over  a  sequential 
machine  is  the  speed  with  which  computation  can  be  performed.  One  of  the  best 
tools  available  to  the  parallel  programmer  is  the  ability  to  see  where  and  how  much 
time  is  being  spent. 

At  a  minimum,  we  need  the  ability  to  sample  a  clock  with  reasonable  preci- 
sion. Both  machines  and  compilers  used  in  this  research  provide  this  capability  (see 
timing. h  in  Appendix  F  for  details).  The  transputers  offer  a  choice  of  frequencies: 
the  clock  associated  with  low  priority  processes  has  a  period  of  64  microseconds  and 
the  high  priority  clock  offers  one  microsecond  ticks.  The  iPSC/2  mclock()  function 
gives  time  in  milliseconds. 

B.   METRICS  FOR  PARALLEL  COMPUTING 
1.    Complexity 

Perhaps  the  most  obvious  measures  for  a  parallel  algorithm  are  simply 
those  that  we  use  for  sequential  algorithms.  We  want  to  keep  time  and  storage 
requirements  to  a  minimum.  Perhaps  the  major  difference  in  complexity  analysis 
for  a  parallel  algorithm  is  that  we  are  primarily  interested  in  a  per-processor  notion 
of  complexity.  If  the  problem  has  been  farmed  out  in  a  fair  manner,  complexity 
analysis  for  the  parallel  case  is  merely  an  extension  of  the  sequential  case. 

Consider  the  matrix  A  G  5?nXn.  Suppose  that  its  elements  are  8-byte, 
double-precision,  floating-point  values  (type  double  in  C).  Let  Mp  denote  the  total 
memory  (in  bytes)  required  to  store  A  on  p  processors  and  let  Tv  denote  the  time 
required  for  p  processors  to  solve  the  system  characterized  by  A.  Then  M\  —  Sn2 
bytes  of  storage,  but  (ideally)  M8  =  n2.  When  the  problem  is  distributed  across  p 
processors  simultaneously,  the  processors  can  share  the  storage  burden. 
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Exceptions  abound.  For  certain  problems,  it  may  actually  be  convenient 
(faster  or  more  reliable)  to  store  the  entire  matrix  at  each  processor.  Nevertheless, 
in  most  cases  we  would  like  to  minimize  local  memory  requirements.  The  Gauss 
factorization  algorithm  considered  near  the  end  of  this  chapter  is  no  exception.  In- 
deed, the  transputers  used  in  this  work  had  only  32  kilobytes  of  storage  each  and 
the  results  of  Chapter  VI  for  transputers  show  how  this  can  dictate  the  size  of  the 
problem  that  can  be  executed.  The  concepts  of  time  and  storage  complexity  have 
been  developed  in  detail  for  sequential  algorithms  and  they  seem  to  hold  a  place  in 
parallel  algorithm  assessment  as  well.  We  consider  other  measures  that  have  been 
developed  for  parallel  computing  in  the  following  section. 

2.    Contemporary  Measures 

The  concepts  of  speedup  and  efficiency  (Appendix  A)  are  two  of  the  most 
common  performance  measures  currently  associated  with  parallel  computing,  with 
the  ideal  case  (100%  efficiency)  yielding  tp  =  t\jP  on  a  P-processor  system.  Selim 
Akl  proposes  the  following  criteria  for  analyzing  algorithms  [Ref.  29:    pp.  21-28]: 

•  Running  Time:  Running  time  t(n)  is  the  time  required  to  execute  an  al- 
gorithm for  a  problem  of  input  size  n.  Akl  lists  three  ways  to  express  this 
notion.  First,  we  may  count  the  steps  in  an  algorithm.  Akl  distinguishes  be- 
tween computational  steps  (i.e.,  something  like  flops)  and  routing  steps  that 
are  associated  with  interprocessor  communication.  Second,  we  have  lower  and 
upper  bounds  (e.g.,  the  complexity  notation  presented  in  Appendix  A).  Fi- 
nally, we  have  speedup.  Akl  gives  the  usual  definition  of  speedup  but  clarifies 
it  somewhat  (details  below). 
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•  Number  of  Processors:  Second  in  importance,  Akl  considers  the  number  of 
processors  required  by  an  algorithm.  He  uses  p(n)  to  denote  the  number  of 
processors  required  for  a  problem  of  size  n. 

•  Cost:  Akl  defines  the  cost,  c(n)  for  a  parallel  algorithm  as  the  product  of  the 
first  two  factors.  That  is,  c(n)  =  t(n)  x  p{n). 

•  Other  Measures:  In  this  category,  we  have  no  less  than  three  other  qualities 
of  a  parallel  system  that  deserve  consideration.  The  area  (i.e.,  chip  real  estate) 
required  by  the  processors  is  significant.  The  length  of  the  links,  as  well  as 
any  patterns  figures  in  (regularity  and  modularity).  And  finally,  the  period 
between  processing  different  elements  of  an  input  is  important. 

Apparently  metrics  for  parallel  computing  are  still  developing.  There  are  several 
very  useful  concepts  such  as  speedup  and  efficiency.  The  definition  of  speedup,  at  a 
first  glance,  is  rather  standard.  It  doesn't  take  much  probing,  however,  to  find  that 
different  authors  make  different  assumptions.  Akl  defines  speedup  S  in  the  usual 
manner, 

s  =  r  (4.i) 

except  that  he  is  somewhat  more  specific  about  the  times.  He  defines  tt  as  the 
"worst-case  running  time  of  fastest  known  sequential  algorithm  for  problem"  and  tp 
as  "worst-case  running  time  of  parallel  algorithm."  [Ref.  29  :  p.  24]  He  has  been 
more  specific  than  most  authors,  but  it  seems  likely  that  the  algorithms,  method  of 
obtaining  times  t^  and  tp,  and  systems  should  also  be  specified.  Speedup  is  defined 
loosely  in  most  cases.  A  parameterization  to  accompany  speedup  would  be  tedious, 
but  useful.  Until  speedup  becomes  a  standard  term  with  accepted  meaning,  we  shall 
have  to  specify  exactly  what  it  means.  We  should  be  more  careful  with  this  term. 
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3.    Other  Ideas 

Akl  has  appropriately  distinguished  between  computational  steps  and  rout- 
ing steps.  The  term  floating-point  operations  (flops)  has  become  quite  popular  (along 
with  benchmarks)  and  this  is  a  useful  means  of  expressing  the  computational  ability 
of  a  machine  (for  floating-point  applications).  The  notion  of  routing,  however,  is 
somewhat  vague.  Nevertheless,  this  idea  must  be  addressed.  It  should  probably 
become  more  specific  as  we  talk  about  similar  machines. 

The  machines  used  for  this  research  were  MIMD  message-passing  systems. 
We  can  get  much  more  specific  about  "routing  steps"  for  such  a  machine.  First,  using 
the  clock  as  a  stopwatch,  we  can  profile  any  segment  of  code  (including  calculations 
and/or  communications).  An  implementation  specific  version  of  Fox's  tcomm/tcaic 
ratio  can  be  instructive.  It  is  important  to  apply  this  ratio  to  the  hardware  as  Fox 
defines  it,  but  it  is  equally  important  to  recognize  the  role  of  the  software  (algorithm). 
That  is,  for  some  specific  implementation,  we  should  be  interested  in  finding  some 
measure  of  how  much  time  is  spent  communicating  and  how  much  time  is  spent 
computing.  More  specifically,  a  careful  profile  could  be  made  of  a  program  in  the 
following  manner. 

The  ratio  of  cumulative  (i.e.,  over  the  execution  of  the  entire  program)  time 
spent  communicating  to  time  spent  computing  should  be  considered  as  a  first  cut, 
especially  if  performance  (efficiency)  is  weak.  Algorithms  such  as  Gauss  factorization 
are  executed  in  stages,  within  a  loop  of  some  sort.  In  this  case,  the  tCOmm  locale 
ratio  per  iteration  is  an  interesting  figure  (and — if  the  loop  represents  most  of  the 
program's  execution  time — this  should  be  approximately  equal  to  the  cumulative 
figure). 

When  possible,  the  analysis  of  communications  complexities  should  be  an- 
alyzed carefully.   For  instance,  in  the  Gauss  factorization  code  that  is  presented  in 
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Appendix  F,  a  C  structure  is  used  to  relay  the  owner  (node  id)  of  a  pivot  and  the 
pivot's  row,  column,  and  value.  This  structure  is  20-bytes  of  data  and  we  know 
the  pattern  with  which  these  structures  are  moved  about  during  the  course  of  the 
program.  It  is  important  to  quantify  communication  like  this  when  possible.  The 
vague  notation  should  lose  significance  in  the  presence  of  such  concrete  information. 

There  are  other  important  and  related  ideas.  The  frequency  and  volume 
of  communications  traffic  is  easy  to  determine  with  a  high  degree  of  accuracy  for 
algorithms  such  as  Gauss  factorization.  Once  again,  in  the  presence  of  this  kind 
of  information,  we  should  dispense  with  vague  concepts.  It  is  useful  to  consider 
something  like  a  pie  chart  showing  the  various  amounts  of  time  spent  on  each  portion 
of  the  major  loop  in  a  program.  Indeed,  this  was  a  part  of  the  development  of  the 
Gauss  code  given  in  this  thesis.  Tools  such  as  these  are  important  in  refining  parallel 
algorithms  and  streamlining  code. 

The  parallel  program  designer  must  consider  many  other  issues  regarding 
communications.  Graph  theory  notation  is  a  natural  tool.  A  link-by-link  analysis 
of  the  communications  over  the  course  of  a  program  is  not  out  of  the  question  (espe- 
cially if  the  communication  is  merely  a  repetition  of  very  simple  messages).  Efficient 
use  of  the  topology  is  important.  We  should  consider  the  percentage  of  links  used, 
balancing  of  the  communications  load,  frequency  of  traffic  for  each  link  (often  the 
communication  comes  in  bursts  and  often  between  iterations  of  the  basic  algorithm), 
flow  rate  (in  bytes  per  second)  for  each  link  during  the  bursts  or  over  longer  periods 
of  time,  timelines  showing  dependencies,  and  other  specific  characteristics  of  commu- 
nications. Analysis  should  be  done  on  a  per-stage  basis  for  algorithms  that  exhibit 
iteration  (loops). 

Perhaps  most  importantly,  a  plan  for  interprocessor  communication  should 
begin  well  in  advance,  before  the  code  is  ever  written.  A  reactive  approach  is  neces- 
sary, like  debugging  code.  But  a  proactive,  strong  design  effort  can  simplify  matters. 
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The  notion  of  communicating  sequential  processes  (CSP)  deserves  attention.  This 
model  is  due  to  C.  A.  R.  Hoare  [Ref.  30],  and  it  is  never  far  away  in  the  world  of  trans- 
puters. There  is  a  very  close  relationship  between  transputers,  occam  (their  native 
language),  and  CSP.  CSP  is  a  useful  paradigm  for  this  sort  of  (message-passing) 
machine.  When  possible,  a  problem  should  be  logically  separated  into  processes. 
The  division  of  the  problem  should  be  natural,  so  that  every  process  represents  a 
logical  group  of  tasks.  The  processes  are  allowed  channels  to  communicate,  and  these 
channels  are  implemented  as  either  links  in  hardware  or  buffers  in  memory  if,  for 
instance,  two  processes  on  the  same  processor  wanted  to  communicate. 

If  a  problem  is  designed  correctly,  we  should  have  substantial  amounts  of 
work  within  a  process  and  minimal  interprocess  communication.  If  the  processes  and 
channels  are  represented  as  the  nodes  and  edges  of  a  directed  graph,  we  can  make 
use  of  some  nice  tools  and  theorems  from  graph  theory.  For  instance,  we  should  like 
to  maximize  computation  and  minimize  communications.  One  natural  method  is  to 
begin  with  atomic  processes  and  start  to  build. 

Suppose  that  we  have  many  such  processes  (at  least  as  many  as  processors) 
and  we  represent  them  as  the  nodes  of  a  directed  graph.  We  can  assign  the  processes 
(nodes)  a  weight  that  reflects  some  form  of  computational  difficulty.  This  should  be 
a  fairly  concrete  number,  assuming  that  the  task  (process)  is  well-defined.  It  might 
be  the  number  of  flops  per  iteration,  for  example.  Next,  the  channels  should  be 
clearly  indicated  as  weighted,  directed  edges.  The  weight  should  usually  be  a  very 
concrete  number  as  well,  like  the  number  of  bytes  that  passes  along  that  channel 
between  each  stage  of  a  computation. 

This  model  gives  the  problem  the  sort  of  order  that  is  necessary  to  keep 
the  parallel  design  simple,  logical,  and  formal  (i.e.,  friendly  for  proof  of  program 
correctness).  Once  the  problem  has  been  expressed  in  such  a  manner,  there  are 
many  options.    For  example,  we  could  consider  minimum  cuts  of  the  flow  rates  to 
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decide  how  to  efficiently  apportion  processes  to  processors.  This  mapping  alone  could 
greatly  enhance  the  performance  of  code. 

It  seems  that  much  of  the  work  in  this  area  is  rather  imprecise  and  generally 
unacceptable.  Granted,  parallel  design  methodology  is  a  relatively  recent  problem 
but  it  can  be  improved  substantially.  Good  parallel  designs  that  consider  these  kinds 
of  issues  and  express  them  clearly  will  likely  be  in  high  demand  as  parallel  computing 
machinery  develops. 

C.   PARALLEL  METHODS 

The  wide-ranging  capabilities  of  contemporary  computing  machinery  are  evi- 
dent. An  exhaustive  list  would  demand  pages,  but  most  readers  could  readily  name 
several  applications  that  bear  little  resemblance  to  each  other.  For  a  single,  very  spe- 
cific machine  there  is  almost  no  limit  to  the  combinations  of  sequential  instructions 
that  it  may  carry  out.  Put  another  way,  a  particular  machine  can  be  designed  and 
built  in  a  few  months  or  years  depending  upon  the  level  of  sophistication  involved. 
But  the  different  types  and  purposes  of  software  that  may  be  created  to  run  on  that 
single  machine  are  nearly  limitless.  Consider  Householder's  comments  on  the  art  of 
computation  [Ref.  17:    p.  1]: 

If  a  computation  requires  more  than  a  very  few  operations,  there  are  usually 
many  different  possible  routines  for  achieving  the  same  end  result.  Even  so  simple 
a  computation  as  ab/c  can  be  done  (ab)/c,  (a/c)b,  or  a(b/c),  not  to  mention  the 
possibility  of  reversing  the  order  of  the  factors  in  the  multiplication.  Mathemat- 
ically these  are  all  equivalent;  computationally  they  are  not  (cf.  §1.2  and  §1-4). 
Various,  and  sometimes  conflicting,  criteria  must  be  applied  in  the  final  selection 
of  a  particular  routine.  If  the  routine  must  be  given  to  someone  else,  or  to  a  com- 
puting machine,  it  is  desirable  to  have  a  routine  in  which  the  steps  are  easily  laid 
out,  and  this  is  a  serious  and  important  consideration  in  the  use  of  sequenced  com- 
puting machines.  Naturally  one  would  like  the  routine  to  be  as  short  as  possible, 
to  be  self-checking  as  far  as  possible,  to  give  results  that  are  at  least  as  accurate  as 
may  be  required.  And  with  reference  to  the  last  point,  one  would  like  the  routine  to 
be  such  that  it  is  possible  to  assert  with  confidence  (better  yet,  with  certainty)  and 
in  advance  that  the  results  will  be  as  accurate  as  may  be  desired,  or  if  an  advance 
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assessment  is  out  of  the  question,  as  it  often  is,  one  would  hope  that  it  can  be  made 
at  least  upon  completion  of  the  computation. 

-  ALSTON  S.  HOUSEHOLDER 

Parallel  algorithms  are  combinations  of  sequential  ones,  so  their  complexity 
can  grow  quickly.  In  general,  the  hardware  issues  surrounding  parallel  problems 
are  mature  and  straightforward.  Software,  on  the  other  hand,  is  developing  and 
generally  difficult  to  use. 

In  addition  to  the  familiar  design  considerations  for  a  straightforward  sequential 
algorithm,  the  design  of  a  parallel  solution  must  specify: 

•  An  awareness  of  the  interaction  between  processing  and  communication.  Fre- 
quency and  duration  (message  length)  of  communications  should  be  known,  if 
possible.  Additionally,  we  should  know  how  this  compares  to  the  frequency 
and  duration  (flops)  of  computing  work. 

•  A  plan  for  interprocessor  communication;  including  hardware  and  software. 

•  A  scheme  for  memory  usage. 

•  The  granularity  of  the  problem  (i.e.,  should  the  processors  be  given  larger  or 
smaller  "chunks"  of  work  at  a  time). 

•  Load  balancing  among  several  processors. 

•  A  method  for  accessing  input/output  resources. 

This  is  a  very  high  level  look  at  the  problem.  The  issue  of  communications  alone, 
can  be  more  than  half  of  the  problem.  The  simplicity  of  this  short  list  does  not  do 
the  problem  justice.  Correct  execution,  as  in  the  sequential  case,  is  very  important. 
But  parallel  algorithms  are  subject  to  the  added  scrutiny  of  performance  data  (e.g., 
efficiency). 
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The  methodology  for  constructing  parallel  algorithms  is  a  very  creative  process, 
and  there  are  many  questions  that  can  be  asked.  Is  a  highly  efficient  parallel  solution 
possible,  or  is  the  problem  bound  by  dependencies  and  sequential  work?  What  is 
the  ratio  of  time  spent  communicating  to  time  spent  computing?  How  nearly  does 
a  given  algorithm  approach  the  optimal  solution?  What  would  happen  on  some 
other  number  of  processors?  Are  there  any  bottlenecks  that  can  be  eliminated? 
Nevertheless,  the  current  performance  of  parallel  machines  and  the  promise  of  fu- 
ture architectures  is  more  than  adequate  motivation  to  continue  developing  these 
products. 

D.   ALGORITHMS 

WTith  the  preceding  concerns  in  mind,  let  us  consider  the  algorithm  for  Gauss 
factorization  that  was  used  in  this  work.  The  algorithm  is  given  at  a  very  high 
level  because  detail  can  be  gleaned  from  Chapter  V  and  from  the  actual  code  in  Ap- 
pendix F.  The  first  consideration  for  GF  was  "How  should  the  work  be  distributed?" 
There  are  many  options.  The  matrix  could  be  distributed  by  rows,  or  columns,  or 
blocks.  The  method  chosen  in  this  case  was  a  distribution  of  the  columns  of  A  across 
the  nodes  of  the  machine.  The  columns  were  distributed  so  that  column  j  went  to 
processor  number  j  (mod  P)  in  a  P-processor  network. 

Such  a  distribution  scheme  seems  natural  for  several  reasons.  First,  the  wrork 
associated  with  the  Gauss  process  moves  toward  the  lower  right-hand  corner  of  the 
matrix  A  E  9cnxn.  By  using  a  modulus  assignment,  and  assuming  that  n  ^>  P,  we 
have  a  situation  where  the  load  on  the  processors  is  nearly  balanced  for  most  of  the 
process.  Second,  a  column-oriented  assignment  places  the  pivot  column  on  a  single 
node  at  each  stage.  This  makes  division  by  the  pivot  value  a  simple  task.  It  is 
interesting  to  note  that  a  similar  distribution  of  A  by  rows  would  have  merit  as  well. 
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Once  the  matrix  has  been  distributed,  the  code  simply  moves,  in  a  synchronized 
fashion,  from  stage  to  stage  of  Gauss.  At  each  stage,  we  must  pivot  according  to 
some  strategy.  The  complete  pivoting  showed  especially  poor  performance  since  it 
involved  a  great  deal  of  communication  and  synchronization  between  stages.  The 
partial  pivoting  method  allows  us  to  determine  which  node  will  have  the  pivot  and 
much  less  communication  is  required  when  this  node  simply  broadcasts  the  pivot  and 
pivot  column.  After  the  pivot  node  divides  every  element  under  the  pivot  by  the 
pivot  value,  it  broadcasts  the  entire  pivot  column  to  every  other  processor.  When  the 
processors  obtain  the  pivot  column,  they  use  the  multipliers  to  perform  arithmetic 
in  the  Gauss  transform  area,  and  then  proceed  to  the  next  stage. 

The  following  algorithms  give  an  overview  of  the  programs  that  appear  in  Ap- 
pendix F. 

Algorithm  4.1  (Parallel  GF:  Host)  At  this  level,  the  host  code  is  essentially  the 
same  for  both  partial  pivoting  and  complete  pivoting.  The  program  is  very  simple: 
distribute  the  columns,  and  then  accept  them  back  one-by-one.  Let  A  6  $RmXn  be 
the  matrix  of  coefficients,  and  let  P  be  the  number  of  processors.  This  algorithm 
forms  the  modified  copy  of  A  by  overwriting  the  original  copy.  After  the  nth  column 
is  returned  from  the  nodes,  we  have  the  factored  version  of  A  that  can  be  separated 
into  L  and  R  in  the  usual  manner. 

begin  GF  (Host) 
for  j  =  0  :  (n  -  1) 

send  A(:,  j)  to  node  (j  mod  P) 
end  for 

for  r  =  0  :  (n  —  1) 

receive  A(:,r)  from  node  (r  mod  P) 
end  for 
end  GF  (Host) 
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Algorithm  4.2  (Parallel  GFPP:  Nodes)  Let  A  €  9cmXn  be  the  entire  matrix 
(held  at  the  host).  This  algorithm  is  executed  on  each  node  in  a  P -processor  network. 
Let  the  node  number  be  N  and  let  A^  6  3ftm*xn  be  the  local  copy  of  select  columns 
of  the  matrix  A  (where  rrifj  «  m/P  is  the  number  of  columns  held  locally).  Let  Gjv 
be  that  part  of  the  Gauss  transform  area,  G,  that  is  held  locally.  This  node  receives 
every  column,  j,  of  A  where  (j  mod  P)  =  N. 

begin  GFPP  (Nodes) 

for  j  =  0  :  (m/v  -  1) 

receive  column  and  place  in  Ax(:,j) 
end  for 

for  r  =  0  :  (n  -  1) 

if  (r  mod  P)  =  N  (pivot  is  held  locally) 

perform  partial  pivoting 

broadcast  pivot  row  index,  5,  to  all  nodes 

perform  pivot  column  arithmetic 

broadcast  pivot  column  to  all  nodes 
else 

receive  pivot  row  index,  5,  and  perform  row  interchanges 

receive  broadcast  of  pivot  column 
end  if 

if  Ar  =  0 

send  pivot  column  to  host 
end  if 

perform  arithmetic  in  G/v 
end  for 
end  GFPP  (Nodes) 
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Algorithm  4.3  (Parallel  GFPC:  Nodes)  Let  A  €  ftmxn  be  the  entire  matrix 
(held  at  the  host).  This  algorithm  is  executed  on  each  node  in  a  P -processor  network. 
Let  the  node  number  be  N  and  let  A^  €  9ftmwXn  be  the  local  copy  of  select  columns 
of  the  matrix  A  (where  m#  «  m/P  is  the  number  of  columns  held  locally).  Let  Gn 
be  that  part  of  the  Gauss  transform  area,  G,  that  is  held  locally.  This  node  receives 
every  column,  j,  of  A  where  (j  mod  P)  =  N. 

begin  GFPC  (Nodes) 

for  j  =  0  :  (mx  —  1) 

receive  column  and  place  in  Apj(:,j) 
end  for 

for  r  =  0  :  (n  —  1) 

locate  best  (local)  pivot  candidate 

elect  pivot  (let  node  Np  hold  the  winner  of  the  pivot  election) 

if  (A>  =  7V) 

broadcast  pivot  indexes,  (s,f),  to  all  nodes 

perform  pivot  column  arithmetic 

broadcast  pivot  column  to  all  nodes 
else 

receive  pivot  indexes,  (s,t) 

perform  permutations 

receive  broadcast  of  pivot  column 
end  if 

if  TV  =  0 

send  pivot  column  to  host 
end  if 

perform  arithmetic  in  Gn 

end  for 

end  GFPC  (Nodes) 
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V.    IMPLEMENTATION 
A.   ENVIRONMENT 

Chapter  IV  introduces  parallel  algorithms  for  Gauss  factorization  (GF).  The 
GF  algorithms  are  produced  for  partial  and  complete  pivoting  strategies.  All  of 
the  programs  associated  with  this  research  are  written  in  parallel  versions  of  the  C 
language  and  executed  on  two  types  of  machines  at  the  U.  S.  Naval  Postgraduate 
School.  The  Math  Department's  iPSC/2  afforded  eight  of  Intel's  CX  type  processors 
arranged  in  a  hypercube  topology.  The  Parallel  Command  and  Decision  Systems 
(PARCDS)  Laboratory  in  the  Computer  Science  Department  has  more  than  seventy 
transputers  available  for  the  experiments.  The  discussion  below  gives  a  more  exact 
description  of  the  material  and  equipment  used  in  the  work. 

1.    Hardware 

This  section  describes  the  machines  upon  which  the  work  was  carried  out. 
A  general  knowledge  is  assumed,  including  familiarity  with  the  Intel  80386  micropro- 
cessor, 80387  math  coprocessor,  and  INMOS  transputers.  Some  of  this  information 
is  provided  in  Appendix  B. 

The  hardware  used  in  this  research  represents  the  state-of-the-art  for  the 
mid-to-late  1980s.  These  machines  are  quickly  becoming  outdated — fitting  the  his- 
tory of  computing — but  both  INMOS  and  Intel  have  more  recent,  competitive  prod- 
ucts in  today's  market  and  fine  prospects  for  future  machines.  So,  while  they  are 
a  bit  dated,  the  products  used  in  this  research  represent  important  contemporary 
parallel  architectures. 
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Figure  5.1:  Hypercube  Interconnection  Topology:  Order  n  <  3 
a.    Networks  of  Transputers 

The  majority  of  the  research  was  performed  upon  hypercubes  of  order 
n  £  {0,1,2,3}.  These  are  the  usual  hypercubes  (see  Appendix  C)  and  each  is 
imbedded  in  the  3-cube.  Figure  5.1  shows  this  topology.  Some  of  the  transputer 
work  for  this  thesis  was  performed  by  a  network  of  sixteen  IMS  T800-20  transputers 
connected  in  nearly  hypercube  fashion  (Figure  5.2).  This  is  not  identical  to  the  4- 
cube,  so  it  will  be  called  the  hybrid  cube  (it  is  used  as  a  root  with  two  subtrees  that 
happen  to  be  3-cubes).  The  subtrees  of  the  hybrid  cube  can  be  distinguished  by  the 
first  bit.  One  of  the  3-cubes  has  labels  like  Oxxx;  the  other  is  labeled  Ixxx. 
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Figure  5.2:  Hybrid  Hypercube  Interconnection  Topology 
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The  rationale  behind  building  the  hybrid  cube  is  purely  practical.  The 
transputers  have  only  four  links.  Assuming  that  we  define  nodes  of  the  hypercube  to 
be  a  single  transputer,  a  pure  hypercube  of  order  four  would  be  a  closed  interconnec- 
tion scheme  with  no  opportunity  for  input  or  output  to  or  from  the  system.  Here, 
the  root  node  has  been  inserted  between  nodes  zero  (0000)  and  eight  (1000).  While 
this  deals  a  horrible  blow  to  the  elegance  of  hypercube  algorithms — particularly 
communications — it  can  be  used  effectively. 

The  hardware  for  the  hybrid  hypercube  is  configured  with  code  by  Mike 
Esposito  [Kef.  31].  This  gives  us  sort  of  an  unlabeled  version  of  the  structure  that 
appears  in  Figure  5.2.  To  make  use  of  this  configuration,  the  nodes  must  be  labeled 
in  a  logical  fashion.  The  Gray  code  (Appendix  C)  is  a  reasonable  choice  for  labeling 
the  nodes.  The  actual  labeling  is  accomplished  by  a  Network  Information  File  (NIF) 
when  the  transputers  are  loaded  by  the  Logical  Systems  C  Network  Loader,  LD- 
NFT.  A  more  detailed  description  of  this  process  is  contained  in  the  file  named 
hyprcube.nif  in  Appendix  F. 

Networks  of  transputers  use  point-to-point  communications  across  bidi- 
rectional links.  The  links  for  this  work  operate  at  20  megabits  per  second  (bidirec- 
tionallv).  That  is,  ten  megabits  per  second  is  a  peak  unidirectional  transmission 
rate.  Curre;,!  transputer  implementations  employ  a  store-and-forward  approach  to 
message  passing  (see  Appendix  B)  for  multi-hop  transmissions. 

b.    Intel  iPSC/2 

The  iPSC/2  used  for  this  research  contained  eight  processors  of  the 
UCX"  type  (803S6/803S7  combination).  The  host  is  an  80386-based  IBM-compatible 
personal  computer  running  AT($:T  UNIX  System  V  (version  3.2).  The  nodes  run  a 
local  subset  of  UNIX  called  NX.  The  host  is  capable  of  supporting  many  users  at 
once,  but  each  node  only  supports  a  single-user. 
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Users  can  request  p  nodes,  where  p  =  2n  for  n  £  {0,  1,2,  3}.  If  another 
user  does  not  already  have  the  requested  portion  of  the  cube,  the  request  is  granted. 
As  long  as  nodes  remain,  another  user  can  access  them.  For  instance,  one  user  could 
be  working  on  two  nodes  and — at  the  same  time — another  user  could  access  up  to 
four  others.  While  the  first  two  users  still  possessed  these  six  nodes,  a  third  user 
could  get  one  or  both  of  the  remaining  two  nodes. 

Unlike  the  transputers,  Intel  uses  a  direct-connect  circuit  switching  (see 
Appendix  B)  approach  to  multi-hop  communications.  There  is  an  overhead  associ- 
ated with  setting  up  the  path  for  communication,  but  this  cost  is  nearly  the  same 
regardless  of  how  many  hops  the  message  cross.  Once  the  circuit  is  established, 
the  message  can  proceed  directly  from  the  origin  to  the  destination  with  negligible 
interference  from  intermediate  nodes. 

c.    Host  and  Root 

The  notion  of  host  is  similar  on  both  machines,  but  there  is  a  slight 
difference.  The  Intel  hypercube  is  directly  connected  to  the  host.  The  transputer 
network,  however,  uses  a  substantially  different  protocol  than  the  typical  personal 
computer.  Transputers  employ  point-to-point  serial  communications,  using  an  11- 
bit  link  protocol  with  byte-by-byte  acknowledgment.  The  acknowledge  is  a  two-bit 
packet  with  dual  meaning.  The  receiving  transputer  has  begun  to  receive  the  byte 
and  it  has  storage  space  for  another. 

In  the  transputer  case,  host  means  the  PC.  We  use  the  term  root  trans- 
puter to  identify  the  transputer  within  the  host  PC  that  acts  something  like  a  host 
to  the  attached  network  of  transputers.  Figure  5.1  illustrates  this  configuration.  An 
IMS  B004  extension  board  in  the  host  PC  holds  a  T414  root  transputer.  The  B004 
is  plugged  into  the  PC's  bus  and  a  parallel-serial  converter  lies  between  the  PC  and 
the  T414.  In  Figure  5.1  the  "host"  is  a  PC  and  the  "root"  transputer  is  the  T414. 
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The  iPSC/2  host  is  simplified,  and  could  almost  be  thought  of  as  a  combination  of 
the  host  and  root  for  the  transputer  case.  Since  the  entire  thesis  uses  the  same  pro- 
grams for  both  machines,  the  root  and  host  terminology  can  become  confusing.  As  it 
is  not  always  convenient  to  express  this  difference  in  painstaking  detail,  I  will  use  the 
terms  somewhat  loosely.  An  understanding  of  the  differences  between  the  machines 
should  serve  to  eliminate  confusion  in  every  case.  When  only  one  of  the  terms  (host 
or  root)  is  needed,  I  have  used  the  correct  term.  When  both  of  the  terms  apply,  I 
have  used  them  almost  interchangeably  and  they  should  be  interpreted  according  to 
the  machine  under  consideration. 

2.    Software 

The  software  for  this  research  was  written  in  the  C  language.  The  Logical 
Systems  C  product  (version  89.1  of  15  January  1990)  was  used  for  the  transputer 
implementation.  For  the  iPSC/2  work,  the  C  compiler  supplied  by  Intel  was  used. 

B.   COMMUNICATIONS  FUNCTIONS 

Prior  to  implementing  the  Gauss  algorithms,  a  substantial  communications 
package  was  constructed.  Most  of  the  code  for  communications  appears  in  the  files 
comm.h  and  comm.c  (see  Appendix  F).  As  expected,  the  header  file  provides 
definitions  for  manifest  constants  and  specifications  (declarations)  for  the  functions. 
An  overview  of  the  functions  provided  in  this  file  is  is  useful  before  we  discuss  the 
Gauss  code  that  called  these  functions. 

The  cubecast()  function  supports  broadcasts  from  the  host  to  all  the  nodes 
of  a  hypercube.  Given  a  hypercube  of  order  n  €  {0,1,2,  3}  with  p  =  2n  processors, 
this  communication  is  completed  in  n,  or  log2(p),  stages.  This  has  some  utility 
in  a  3-cube,  but  imagine  the  impact  in  a  10-cube.  All  1,024  processors  in  the 
hypercube  would  have  the  message  after  10  stages  of  communication.  This  function 
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is  especially  useful  at  the  beginning  of  a  problem,  when  data  must  be  shipped  to 
each  of  the  workers  in  the  network. 

Often  we  need  to  gather  information  in  the  reverse  direction,  from  the  workers 
back  to  the  root.  The  coalesce()  function  is  one  way  to  accomplish  this  task.  If  no 
modification  was  necessary  at  intermediate  nodes,  this  operation  could  be  completed 
without  interference.  In  the  algorithms  that  I  used,  however,  there  was  occasion  to 
modify  the  information  along  the  way  back  to  the  root.  For  this  reason,  the  gathering 
is  accomplished  using  two  function  calls.  First,  information  is  coalesced  to  a  given 
node.  Upon  return  from  coalesceQ,  the  data  exists  locally  and  may  be  operated 
upon.  When  the  data  is  ready  for  submission,  the  submit()  function  is  used  to  pass 
it  one  step  closer  to  the  root. 

A  modification  of  the  cubecastQ  function  that  was  useful  for  the  Gauss  prob- 
lem was  cubecast_from().  This  function  does  not  assume  that  the  host  is  the 
originator  of  the  broadcast.  Instead,  the  source  is  specified  as  the  first  argument  to 
this  function.  The  function  still  performs  the  broadcast  in  log2(p)  stages,  but  it  uses 
the  concept  of  a  direction  to  accomplish  this. 

The  concept  of  directions  in  the  hypercube  turns  out  to  be  a  fairly  useful 
one.  For  concreteness,  consider  the  3-cube  shown  in  Figure  C.2.  Starting  at 
any  given  node,  we  can  specify  a  direction  using  one  of  the  three  combinations 
d  £  {001,010,100}.  Suppose  that  the  node's  label  is  t  and  let  ©  denote  the  exclu- 
sive OR  operation.  Then  for  some  direction,  d,  the  number  (£(&d)  is  the  label  of  the 
node  in  the  direction  d  from  the  node  £. 

This  concept  can  be  applied  in  general  in  a  hypercube  of  order  n  using  n-bit 
labels  for  the  nodes  and  some  direction  d.  The  possible  directions  are  all  the  n 
combinations  of  (n  —  1)  zeros  and  a  single  one  in  an  n-bit  number.  Accordingly, 
the  code  uses  directions  d  G  {1,2,4,.  .  . 2n-1}.  In  most  cases,  when  a  direction-by- 
direction  approach  is  desired  for  all  possible  directions,  we  start  with  one  and  use 
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the  C  left  shift  operator  (<<)  to  produce  the  other  directions  incrementally. 

These  functions  and  several  others  are  described  in  detail  in  the  code  of  Ap- 
pendix F,  but  these  basic  ideas  give  us  a  reasonably  good  introduction  at  a  level 
that  is  adequate  for  understanding  the  algorithms. 

C.   CODE  DESCRIPTIONS 

A  detailed  description  of  the  source  code  used  to  implement  the  algorithms  of 
Chapter  IV  is  given  in  the  header  file  gf.h.  This  header  file,  located  in  Appendix  F,  is 
used  by  both  the  partial  pivoting  and  complete  pivoting  codes.  The  code  for  GF  with 
partial  pivoting  can  be  found  in  gfpphost.c,  the  host  program,  and  gfppnode.c, 
the  node  program.  The  code  for  the  complete  pivoting  algorithm  is  similar  except 
for  the  election  of  pivots,  so  most  of  it  has  been  omitted  in  the  interest  of  saving 
space.  Only  the  elect_next_pivot()  function  remains  because  it  is  the  significant 
difference  between  the  partial  and  complete  pivoting  codes.  This  function  appears 
in  gfpcnode.c. 
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VI.    RESULTS 
A.   GAUSS  WITH  COMPLETE  PIVOTING 

The  host  code,  gfpchost.c,  and  the  node  program,  gfpcnode.c,  are  written 
to  provide  a  parallel  implementation  of  Gauss  Factorization  with  complete  pivoting. 
Since  the  columns  of  A  are  distributed  among  the  nodes  of  the  multiprocessor  system, 
the  selection  of  each  pivot  requires  communication.  The  selection  process,  in  this 
case,  begins  with  each  node  selecting  its  own  best  candidate  for  pivot.  Once  each 
of  the  nodes  has  made  this  choice,  an  election  is  held  to  select  the  best  candidate 
among  all  of  the  nodes. 

Implementation  details  for  the  election  process  are  described  in  the  source  code, 
so  a  detailed  description  is  not  given  here.  Nevertheless,  these  results  show  how 
communication — like  the  election  process — can  withstand  efficient  parallel  program- 
ming. This  program  shows  how  parallel  performance  can  suffer  from  the  effects  of 
communications.  (Recall  Fox's  tcomm/tcaic  and  Seitz's  three  components  of  overhead 
from  Chapter  IV). 

The  complete  pivoting  strategy  inserts  inefficient  communications  between  each 
stage  of  the  process.  The  communications  themselves  are  bound  to  be  inefficient  since 
the  election  process  finds  all  nodes  of  an  n-cube  participating  in  an  n-stage  exchange 
of  a  20-byte  structure  (pivot  candidates).  In  addition  to  the  use  of  small  messages, 
the  election  imposes  an  added  measure  of  synchronization  upon  the  problem.  This 
allows  the  processors  less  independence  and  forces  them  to  transition  between  "use- 
ful" program  execution  and  communication  more  frequently.  This  transition  can 
become  burdensome  and  the  processor  can  eventually  find  little  time  to  perform 
calculations. 
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In  addition  to  the  election  process,  there  is  a  one-to-all  broadcast  from  the 
node  holding  the  pivot  to  inform  the  others  of  the  pivot  column  values.  With  an 
mxm  matrix  A,  this  message  is  essentially  a  column  of  m  double  precision  floating- 
point values.  Doubles  for  this  implementation  were  eight  bytes  each,  so  this  is  a 
unidirectional  broadcast  of  8m  bytes  with  exponential  fanout. 

The  election  process — as  simple  as  it  appears — will  prove  to  be  an  obstacle 
that  opposes  efficiency.  Both  the  iPSC/2  and  transputer  systems  reward,  in  terms 
of  transmission  rates,  the  sender  of  long  messages.  Short  messages  are  essentially 
penalized  by  the  overhead  involved  in  setting  up  the  transmission  line  and  manager. 
Let  us  consider  the  results  of  this  complete  pivoting  strategy.  The  results  from  the 
iPSC/2  appear  first  followed  by  the  transputer  results.  The  largest  dimension,  n, 
that  is  recorded  is  n  =  176.  The  iPSC/2  machine  would  handle  larger  problems,  but 
this  seemed  pointless  since  the  performance  appears  to  approach  maximum  efficiency 
early. 

1.    Data  for  the  iPSC/2  System 

Table  6.1  shows  the  timing  data  for  execution  of  Gauss  Factorization  with 
complete  pivoting  on  the  Intel  iPSC/2  system. 
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TABLE  6.1:  EXECUTION  TIMES  FOR  GF(PC)  ON  THE  iPSC/2 


D] 

mension 
(n) 

Time  (seconds)  on  a 

Hypercube  of  Order 

0 

1 

2 

3 

8 

0.126 

0.097 

0.092 

0.155 

16 

0.716 

0.674 

0.608 

0.744 

24 

2.208 

1.751 

1.616 

1.568 

32 

4.627 

3.705 

3.239 

3.149 

40 

9.246 

6.888 

5.895 

5.250 

48 

14.888 

11.479 

9.770 

9.109 

56 

23.686 

17.883 

15.206 

13.796 

64 

36.123 

26.424 

22.326 

19.957 

72 

49.227 

38.178 

31.421 

28.460 

80 

70.546 

50.754 

42.087 

37.810 

88 

89.210 

69.257 

56.803 

51.148 

96 

115.473 

86.760 

72.346 

63.954 

104 

150.915 

110.247 

91.966 

82.680 

112 

182.475 

138.880 

114.486 

102.266 

120 

224.458 

168.056 

139.587 

123.683 

128 

282.491 

206.222 

170.650 

153.379 

136 

339.076 

248.422 

208.745 

186.205 

144 

385.623 

295.217 

241.564 

217.099 

152 

468.763 

345.049 

281.972 

254.538 

160 

527.953 

404.235 

331.653 

292.352 

168 

636.004 

457.089 

381.597 

338.464 

176 

723.596 

532.597 

449.745 

395.008 
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TABLE  6.2:  SPEEDUPS  FOR  GF(PC)  ON  THE  iPSC/2 


Dimension 
(») 

Speedup  on  a  Hypercube  of  Order 

1 

2 

3 

8 

1.299 

1.373 

0.813 

16 

1.063 

1.178 

0.962 

24 

1.261 

1.367 

1.408 

32 

1.249 

1.429 

1.470 

40 

1.342 

1.569 

1.761 

48 

1.297 

1.524 

1.635 

56 

1.324 

1.558 

1.717 

64 

1.367 

1.618 

1.810 

72 

1.289 

1.567 

1.730 

80 

1.390 

1.676 

1.866 

88 

1.288 

1.571 

1.744 

96 

1.331 

1.596 

1.806 

104 

1.369 

1.641 

1.825 

112 

1.314 

1.594 

1.784 

120 

1.336 

1.608 

1.815 

128 

1.370 

1.655 

1.842 

136 

1.365 

1.624 

1.821 

144 

1.306 

1.596 

1.776 

152 

1.359 

1.662 

1.842 

160 

1.306 

1.592 

1.806 

168 

1.391 

1.667 

1.879 

176 

1.359 

1.609 

1.832 

The  speedup  data  that  is  shown  in  Table  6.2  is  derived  from  these  execution  times. 
Speedup  was  calculated  using  the  usual  formula  (see  Appendix  A  for  details) 


sP-rj 


for  speedup  on  p  processors. 
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TABLE  6.3:  EFFICIENCIES  FOR  GF(PC)  ON  THE  iPSC/2 


Dimension 

Efficiency  (percent)  on  a  Hypercube  of  Order 

1 

2 

3 

8 

64.948 

34.332 

10.161 

16 

53.155 

29.441 

12.024 

24 

63.068 

34.169 

17.603 

32 

62.451 

35.716 

18.370 

40 

67.122 

39.215 

22.015 

48 

64.852 

38.098 

20.431 

56 

66.225 

38.943 

21.462 

64 

68.354 

40.450 

22.625 

72 

64.470 

39.168 

21.621 

80 

69.498 

41.905 

23.323 

88 

64.405 

39.263 

21.802 

96 

66.548 

39.903 

22.570 

104 

68.444 

41.025 

22.816 

112 

65.695 

39.847 

22.304 

120 

66.781 

40.200 

22.685 

128 

68.492 

41.385 

23.022 

136 

68.246 

40.609 

22.762 

144 

65.312 

39.909 

22.203 

152 

67.927 

41.561 

23.020 

160 

65.303 

39.797 

22.574 

168 

69.571 

41.667 

23.489 

176 

67.931 

40.223 

22.898 

Given  the  execution  times  and  speedups  presented  in  Tables  6.1  and  6.2,  and  using 
the  formula 

P 
(as  defined  in  Appendix  A),  we  can  determine  the  efficiency  of  p  processors  applied 

to  the  Gauss  problem.  This  efficiency  data  is  shown  in  Table  6.3. 
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Figure  6.1:  Efficiencies  for  GF  (PC)  on  the  iPSC/2 

Many  different  graphical  displays  of  this  data  would  be  interesting,  but  the  efficiency 
data  may  be  the  most  interesting  since  it  sort  of  captures  the  success  or  failure  of  a 
parallel  program  (i.e.,  poor  efficiencies  should  lead  us  to  question  the  parallel  nature 
of  the  algorithm).  Figure  6.1  shows  a  scatterplot  of  the  data  from  Table  6.3. 
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TABLE  6.4:  EXECUTION  TIMES  FOR  GF(PC)  ON  THE  TRANSPUTERS 


D 

mension 
(n) 

Time 

seconds 

on  a  H> 

percube  c 

f  Order 

0 

1 

2 

3 

4 

8 

0.00S3 

0.0075 

0.0077 

0.0088 

0.0925 

16 

0.0481 

0.0392 

0.0373 

0.0372 

0.1236 

24 

0.1494 

0.1173 

0.1063 

0.1001 

0.1855 

32 

0.3417 

0.2580 

0.2220 

0.2132 

0.2947 

40 

0.6538 

0.4922 

0.4135 

0.3798 

0.4587 

48 

1.1158 

0.8202 

0.6934 

0.6397 

0.7041 

56 

1.2950 

1.0716 

0.9696 

1.0239 

64 

1.8940 

1.5688 

1.4046 

1.4407 

72 

2.2116 

1.9817 

1.9808 

80 

2.9560 

2.6529 

2.6248 

88 

3.9127 

3.4812 

3.4090 

96 

4.4808 

4.3S12 

104 

5.6442 

5.4519 

112 

7.0388 

6.7087 

120 

8.5430 

8.1252 

128 

10.3300 

9.7532 

136 

11.6930 

144 

13.6538 

152 

16.1029 

160 

18.5476 

168 

21.4437 

176 

24.4684 

"mar 

48 

67 

92 

128 

176 

2.    Data  for  the  Transputer  System 

Using  the  same  methods,  the  timing  (Table  6.4),  speedup  (Table  6.5),  and 
efficiency  (Table  6.6)  data  for  the  transputer  system  is  determined.  Unfortunately, 
the  memory  limitations  of  the  transputers  used  for  this  work  prevented  comparisons 
for  large  problem  size.  Empty  portions  of  Table  6.4  signify  inavailability  of  data  (i.e., 
execution  failure  due  to  inappropriate  or  excessive  problem  size).  The  maximum 
problem  size  that  executed  successfully  for  each  configuration  is  listed  on  the  last 
line  of  the  Table.     Figure  6.2  shows  a  scatterplot  of  the  data  from  Table  6.6. 
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TABLE  6.5:  SPEEDUPS  FOR  GF(PC)  ON  THE  TRANSPUTERS 


Dimension 
(n) 

Speedup  on  a  Hypercube  of  Order 

1 

2 

3 

4 

8 

1.111 

1.074 

0.942 

0.090 

16 

1.227 

1.288 

1.290 

0.389 

24 

1.274 

1.405 

1.493 

0.805 

32 

1.324 

1.539 

1.602 

1.159 

40 

1.328 

1.581 

1.721 

1.425 

48 

1.360 

1.609 

1.744 

1.585 

56 

1.363 

1.648 

1.821 

1.724 

64 

1.389 

1.677 

1.872 

1.826 

72 

1.691 

1.887 

1.888 

80 

1.734 

1.932 

1.953 

88 

1.743 

1.959 

2.001 

96 

1.975 

2.020 

104 

1.993 

2.064 

112 

1.996 

2.094 

120 

2.022 

2.126 

128 

2.030 

2.150 

136 

2.150 

144 

2.186 

152 

2.180 

160 

2.207 

168 

2.210 

176 

2.227 

104 


TABLE  6.6:  EFFICIENCIES  FOR  GF(PC)  ON  THE  TRANSPUTERS 


Dimension 
(n) 

Efficiency  (percent)  on 

a  Hypercube  of  Order 

1 

2 

3 

4 

8 

55.556 

26.860 

11.775 

1.125 

16 

61.356 

32.204 

16.130 

2.431 

24 

63.693 

35.133 

18.662 

5.034 

32 

66.224 

38.477 

20.029 

7.246 

40 

66.409 

39.526 

21.514 

8.908 

48 

68.017 

40.230 

21.803 

9.905 

56 

68.167 

41.190 

22.760 

10.776 

64 

69.431 

41.913 

23.406 

11.410 

72 

42.279 

23.592 

11.801 

80 

43.358 

24.155 

12.207 

88 

43.575 

24.488 

12.504 

96 

24.691 

12.626 

104 

24.916 

12.897 

112 

24.948 

13.088 

120 

25.279 

13.289 

128 

25.369 

13.435 

136 

13.440 

144 

13.662 

152 

13.623 

160 

13.795 

168 

13.812 

176 

13.917 
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Figure  6.2:  Efficiencies  for  GF  (PC)  on  Transputers 
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B.   GAUSS  WITH  PARTIAL  PIVOTING 
1.    Data  for  the  iPSC/2  System 

Table  6.7  shows  the  timing  data  for  execution  of  the  Gauss  Factorization 
(partial  pivoting)  codes  (gfpphost.c  and  gfppnode.c)  on  the  Intel  iPSC/2  system. 
The  speedup  data  that  is  shown  in  Table  6.8  is  derived  from  these  execution  times. 
Speedup  was  calculated  using  the  usual  formula  (see  Appendix  A  for  details) 

xv 
for  speedup  on  p  processors.    Given  the  execution  times  and  speedups  presented  in 
Tables  6.7  and  6.8,  and  using  the  formula 

P 

(as  defined  in  Appendix  A),  we  can  determine  the  effectiveness  (efficiency)  of  p 
processors  applied  to  the  Gauss  problem.  This  efficiency  data  is  shown  in  Table  6.9. 
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TABLE  6.7:  EXECUTION  TIMES  FOR  GF(PP)  ON  THE  iPSC/2 


Di 

mension 
(n) 

Time  (seconds)  on  a 

Hypercube  of  Order 

0 

1 

2 

3 

8 

0.109 

0.130 

0.127 

0.155 

16 

0.371 

0.359 

0.394 

0.493 

24 

0.508 

0.489 

0.519 

0.624 

32 

0.752 

0.673 

0.675 

0.782 

40 

1.055 

0.880 

0.834 

0.911 

48 

1.499 

1.144 

1.024 

1.067 

56 

2.019 

1.473 

1.248 

1.228 

64 

2.733 

1.878 

1.491 

1.402 

72 

3.646 

2.412 

1.872 

1.721 

80 

4.743 

3.040 

2.256 

1.989 

88 

6.053 

3.719 

2.644 

2.237 

96 

7.567 

4.547 

3.125 

2.560 

104 

9.431 

5.477 

3.698 

2.912 

112 

11.468 

6.561 

4.252 

3.237 

120 

13.847 

7.859 

4.933 

3.646 

128 

16.552 

9.211 

5.661 

4.070 

136 

19.619 

10.873 

6.590 

4.633 

144 

23.071 

12.632 

7.532 

5.170 

152 

26.982 

14.681 

8.940 

5.866 

160 

31.204 

16.869 

9.866 

6.539 

168 

35.865 

19.318 

11.143 

7.284 

176 

41.064 

21.990 

12.605 

8.084 

200 

59.453 

31.437 

17.598 

10.910 

225 

83.962 

44.076 

24.329 

14.701 

250 

114.319 

59.515 

32.410 

19.118 

275 

151.443 

78.652 

42.336 

24.512 

300 

195.822 

102.589 

54.138 

30.927 

325 

248.153 

127.840 

68.082 

38.418 

350 

309.241 

158.859 

84.072 

46.978 

375 

379.538 

194.599 

101.984 

56.280 

400 

459.740 

235.259 

122.946 

67.366 

425 

550.536 

281.312 

147.058 

80.439 

450 

653.070 

333.180 

173.748 

94.656 

475 

767.616 

391.136 

203.513 

110.243 

500 

894.705 

455.308 

236.483 

127.631 
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TABLE  6.8:  SPEEDUPS  FOR  GF(PP)  ON  THE  iPSC/2 


Dimension 

Speedup  on  a  Hypercube  of  Order 

1 

2 

3 

8 

0.842 

0.860 

0.704 

16 

1.035 

0.941 

0.753 

24 

1.039 

0.979 

0.814 

32 

1.118 

1.114 

0.961 

40 

1.199 

1.265 

1.158 

48 

1.311 

1.465 

1.405 

56 

1.371 

1.618 

1.645 

64 

1.455 

1.833 

1.949 

72 

1.512 

1.948 

2.119 

80 

1.5G0 

2.102 

2.384 

88 

1.628 

2.289 

2.706 

96 

1.664 

2.422 

2.956 

104 

1.722 

2.550 

3.239 

112 

1.748 

2.697 

3.543 

120 

1.762 

2.807 

3.798 

128 

1.797 

2.924 

4.067 

136 

1.804 

2.977 

4.235 

144 

1.826 

3.063 

4.462 

152 

1.83S 

3.018 

4.600 

160 

1.850 

3.163 

4.772 

168 

1.857 

3.219 

4.924 

176 

1.867 

3.258 

5.080 

200 

1.891 

3.378 

5.449 

225 

1.905 

3.451 

5.711 

250 

1.921 

3.527 

5.980 

275 

1.925 

3.577 

6.178 

300 

1.909 

3.617 

6.332 

325 

1.941 

3.645 

6.459 

350 

1.947 

3.678 

6.583 

375 

1.950 

3.722 

6.744 

400 

1.954 

3.739 

6.825 

425 

1.957 

3.744 

6.844 

450 

1.960 

3.759 

6.899 

475 

1.963 

3.772 

6.963 

500 

1.965 

3.783 

7.010 
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TABLE  6.9:  EFFICIENCIES  FOR  GF(PP)  ON  THE  iPSC/2 


Dimension 
(n) 

Efficiency  (percent)  on  a  Hypercube  of  Order 

1 

2 

3 

8 

42.085 

21.499 

8.803 

16 

51.743 

23.526 

9.416 

24 

51.943 

24.470 

10.174 

32 

55.911 

27.842 

12.019 

40 

59.943 

31.615 

14.472 

48 

65.544 

36.615 

17.563 

56 

68.557 

40.453 
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72 
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Figure  6.3:  Efficiencies  for  GF  (PP)  on  the  iPSC/2 

Here,  again,  only  the  efficiency  is  plotted.  Figure  6.3  shows  a  scatterplot  of  the  data 
from  Table  6.9. 
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2.    Data  for  the  Transputer  System 

Using  the  same  methods;  the  timing  (Table  6.10),  speedup  (Table  6.11),  and 
efficiency  (Table  6.12)  data  for  the  transputer  system  is  determined.  Unfortunately, 
the  memory  limitations  of  the  transputers  (32  kilobytes  per  node)  used  for  this 
work  prevented  comparisons  for  large  (interesting)  problem  size.  Empty  portions  of 
Table  6.10  signify  inavailability  of  data  (i.e.,  execution  failure  due  to  inappropriate 
or  excessive  problem  size).  The  maximum  problem  size  that  executed  successfully 
for  each  configuration  is  listed  on  the  last  line  of  Table  6.10.  The  minimum  problem 
size  for  the  hybrid  cube  on  16  processors  was  one  where  the  dimension  of  A  was 
n=  16. 


112 


TABLE  6.10:  EXECUTION  TIMES  FOR  GF(PP)  ON  THE  TRANSPUTERS 


Dimension 
(n) 

Time  (seconds)  on  a  Hypercube  of  Order 

0 

1 

2 

3 

4 

8 

0.0906 

0.0904 

0.0906 

0.0909 

16 

0.1126 

0.1101 

0.1102 

0.1107 

0.1092 

24 

0.1582 

0.1480 

0.1462 

0.1461 

0.1439 

32 

0.2312 

0.2038 

0.1965 

0.1952 

0.1889 

40 

0.3360 

0.2765 

0.2568 

0.2520 

0.2446 

48 

0.3782 

0.3402 

0.3297 

0.3149 

56 

0.5124 

0.4463 

0.4258 

0.4064 

64 

0.6911 

0.5863 

0.5505 

0.5196 

72 

0.7277 

0.6715 

0.6308 

80 

0.8976 

0.8147 

0.7560 

88 

1.0675 

0.9482 

0.8732 

96 

1.1584 

1.0581 

104 

1.3657 

1.2430 

112 

1.6129 

1.4551 

120 

1.8388 

1.6490 

128 

1.8585 

136 

2.1306 

144 

2.3606 

152 

2.6717 

160 

2.9846 

168 

3.2910 

176 

3.6606 

"max 

47 

66 

92 

127 

176 
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TABLE  6.11:  SPEEDUPS  FOR  GF(PP)  ON  THE  TRANSPUTERS 


Dimension 

Speedup  on  a  Hypercube  of  Order 

1 

2 

3 

4 

8 

1.002 

1.000 

0.997 

16 

1.023 

1.022 

1.017 

1.031 

24 

1.069 

1.082 

1.083 

1.099 

32 

1.134 

1.177 

1.184 

1.224 

40 

1.215 

1.308 

1.333 

1.374 

48 

1.302 

1.447 

1.493 

1.563 

56 

1.387 

1.592 

1.669 

1.748 

64 

1.448 

1.707 

1.818 

1.926 

72 

1.888 

2.046 

2.178 

80 

2.049 

2.258 

2.433 

88 

2.256 

2.539 

2.758 

96 

2.667 

2.920 

104 

2.853 

3.134 

112 

2.998 

3.323 

120 

3.219 

3.590 

128 

3.852 

136 

4.019 

144 

4.296 

152 

4.456 

160 

4.646 

168 

4.871 

176 

5.031 
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TABLE  6.12:  EFFICIENCIES  FOR  GF(PP)  ON  THE  TRANSPUTERS 


Dimension 
(») 

Efficiency  (percent)  on 

a  Hypercube  of  Order 

1 

2 

3 

4 

8 

50.111 

25.000 

12.459 

16 

51.135 

25.544 

12.715 

6.445 

24 

53.446 

27.052 

13.535 

6.871 

32 

56.722 

29.415 

14.805 

7.650 

40 

60.759 

32.710 

16.667 

8.585 

48 

65.090 

36.180 

18.666 

9.772 

56 

69.334 

39.801 

20.859 

10.927 

64 

72.412 

42.678 

22.727 

12.039 

72 

47.193 

25.571 

13.611 

80 

51.228 

28.220 

15.206 

88 

56.392 

31.744 

17.235 

96 

33.343 

18.252 

104 

35.657 

19.589 

112 

37.475 

20.770 

120 

40.241 

22.436 

128 

24.073 

136 

25.116 

144 

26.849 

152 

27.850 

160 

29.036 

168 

30.447 

176 

31.441 

115 


c 

E 


80 
70 
60 
50 
40 
30 
20- 
10- 
0 


i ♦ ■ i 1 i i i _ 

•  o 

■     :  O 

• : : : : .- * * ; — 

o 
:  :    ° 

- j ■ «  ■■• r '■: * \ i - 

+ 

o  ;  + 

I       +       +l  x 
: _ : : : : ; : X *_ 

0      ;  4  x     x 

o  x       A 

o      o  +  _      x  : 

j    +  X 

- i i .+...; > !-x » ; i i - 

+  X 

+  :  X 

+  x 

+     :     +  X 

+         ^     :  :  v  :  : 

- ; i x "., 

X 

x        x        * 


20  40  60  80  100         120         140         160         180 

Dimension,  n,  of  the  Matrix  A 


*       Order  1 


o       Order  2 


+      Order  3 


x       Order  4 


Figure  6.4:  Efficiencies  for  GF  (PP)  on  Transputers 


Figure  6.4  shows  a  scatterplot  of  the  data  from  Table  6.12. 
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VII.    CONCLUSIONS 

/  value  the  discovery  of  a  single  even  insignificant  truth  more  highly  than  all 
the  argumentation  on  the  highest  questions  which  fails  to  reach  a  truth. 

—  GALILEO  (1564-1642) 

A.   SIGNIFICANCE  OF  THE  RESULTS 
1.    Communications  and  Computation 

Perhaps  one  of  the  most  obvious  effects  that  can  be  noticed  in  the  results 
of  Chapter  VI  is  the  abysmal  performance  of  the  complete  pivoting  code  when  com- 
pared to  the  partial  pivoting  implementation.  The  relatively  small  amount  of  extra 
communications  required  for  the  complete  pivoting  algorithm  seems  to  force  syn- 
chronization delays,  thus  reducing  the  system's  performance.  This  demonstrates  the 
criticality  of  balancing  communications  with  calculation  in  parallel  processing.  The 
conclusion,  for  this  problem,  is  that  parallel  designs  must  minimize  the  frequency  of 
synchronizing  events  and  minimize  the  communications  volume  on  occasions  when 
communication  is  necessary.  The  greater  the  amount  of  uninterrupted  work  that  a 
processor  can  accomplish,  the  better.  While  control,  i.e.,  blocking  communications, 
synchronization,  loop-by-loop  data  distribution,  is  necessary  it  will  have  adverse  im- 
pacts on  performance.  The  individual  processors  of  a  multiprocessor  system  should 
be  granted  the  maximum  degree  of  independence  that  the  mission  will  allow. 

While  there  is  undoubtedly  some  room  for  improvement  in  the  complete 
pivoting  code,  it  would  appear  that  maximum  efficiencies  of  approximately  22%, 
40%,  and  70%  for  hypercubes  of  order  three,  two,  and  one,  respectively,  are  likely  on 
the  iPSC/2.   The  same  code  seems  to  be  headed  for  somewhat  better  performance 
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on  the  transputers,  but  with  the  shortage  of  memory,  it  is  difficult  to  extrapolate 
and  determine  the  direction  of  the  plots.  The  higher  order  cubes  appear  to  flatten 
at  about  the  same  efficiency  that  the  iPSC/2  showed  as  a  terminal  efficiency. 

The  partial  pivoting  code,  on  the  other  hand,  exhibits  the  kind  of  charac- 
teristics that  we  like  to  see  in  parallel  code.  Both  systems  show  efficiencies  rising 
sharply  (again,  the  size  limit  for  the  transputers  is  unfortunate)  and  the  iPSC/2 
shows  some  very  nice  results  as  the  dimension  of  the  matrix  exceeds  about  250. 

B.   THE  TERAFLOP  RACE 

One  of  the  biggest  challenges  to  parallel  computing  today  can  be  found  in  the 

"teraflop  race'".    There  are  at  least  three  competitors  with  teraflop  initiatives:  the 

United  States,  Europe,  and  Japan.    The  United  States  effort  centers  around  Intel 

with  projects  like  Touchstone  (Chapter  I).  The  European  effort  relies  on  the  T9000 

transputer.  Considering  the  three  to  five  year  old  technology  used  for  this  research, 

together  with  the  numbers  that  the  various  parallel  computer  designers  boast  today, 

it  seems  that  we  might  see  teraflop  performance  by  the  mid-1990s.  C.  Gordon  Bell 

claims  that  the  teraflop  is  conceivable  [Ref.  6:    p.  1099] 

Two  relatively  simple  and  sure  paths  exist  for  building  a  system  that  could 
deliver  on  the  order  of  J  teraflop  by  J 995.  They  are:  (1)  A  ^A'  node  multicomputer 
with  800  gigaflops  peak  or  a  32K  node  multicomputer  with  1.5  teraflops.  (2)  A 
Connection  Machine  with  more  than  one  teraflop  and  several  million  processing 
elements. 

Current  products  suggest  that  INMOS  and  Intel  will  be  among  the  most  likely 
competitors.  Table  7.1,  adapted  from  Jack  Dongarra's  report  [Ref.  8:  p.  20],  shows 
how  transputer-based  systems  compare  to  Intel  products.  This  Table  summarizes  a 
test  involving  the  solution  for  a  1000  x  1000  system  of  linear  equations.  The  proces- 
sors used  for  my  thesis  show  floating-point  capabilities  of  0.37  Mflops  (T800-20)  and 
0.16  Mflops  (Compaq  386/20  with  80387)  in  Dongarra's  report  [Ref.  8  :    pp.  14,  16]. 
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TABLE  7.1:  PARALLEL  MACHINE  COMPARISON 


Computer 

u 

P 

<P 

Speedup 

Efficiency 

Parsytec  FT-400 

1075 

400 

4.90 

219.0 

.55 

Parsytec  FT-400 

1075 

256 

6.59 

163.0 

.64 

Parsytec  FT-400 

1075 

100 

13.20 

81.4 

.81 

Parsytec  FT-400 

1075 

64 

19.10 

56.3 

.88 

Parsytec  FT-400 

1075 

16 

69.20 

15.5 

.97 

Intel  iPSC/860 

59 

32 

5.30 

11.0 

.34 

Intel  iPSC/860 

59 

16 

6.80 

8.7 

.54 

Intel  iPSC/860 

59 

8 

10.60 

5.6 

.70 

The  iPSC/860  illustrates  the  most  recent  technology  and  shows  excellent  uniproces- 
sor performance  (6.5  Mflops)  [Ref.  8  :  p.  9].  The  T800  transputer  that  Parsytec 
used  is  somewhat  dated  and  will  soon  be  replaced  by  the  T9000.  Nevertheless,  the 
transputer-based  system  shows  good  parallel  performance.  The  times  of execution  in 
the  experiments  of  this  thesis  also  indicate  that  the  T800  is  faster  for  floating-point 
calculations  than  the  386/387  combination  in  the  iPSC/2. 

C.   FURTHER  WORK 

My  research  suggests  many  areas  for  further  investigation.  The  method  of 
conjugate  gradients  shows  a  great  deal  of  promise  as  a  candidate  for  parallelization. 
Indeed,  it  was  the  original  aim  of  this  thesis,  but  the  development  of  other  portions  of 
the  code  required  a  great  deal  of  time.  The  parallel  CG  algorithm  should  be  relatively 
simple  to  code  and  holds  great  potential  with  respect  to  performance.  Additionally, 
it  possesses  a  nontrivial  derivation  and  the  theory  behind  the  algorithm  would  be 
interesting  to  develop. 

There  are  many  other  variations  on  Gauss  factorization  that  could  be  coded 
and  tested.  While  the  programs  presented  in  this  thesis  are  designed  in  an  effort 
to  produce  efficient  performance,  there  is  undoubtedly  much  that  might  be  done  to 
enhance  this  code.    Among  the  options:  at  a  very  basic  level,  we  could  begin  with 
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other  distributions  of  the  matrix  A.  A  block  method  or  row  method  may  actually 
yield  better  performance.  As  the  LINPACK  benchmarks  seem  to  use  blocks,  this  is 
probably  worth  pursuing. 

General  purpose  parallel  computing,  the  ability  to  rely  on  parallel  architectures 
for  general  purpose  computation  without  a  need  for  investigation  to  be  more  con- 
cerned with  the  architecture  than  the  problem  being  computed,  still  requires  much 
work.  The  ability  to  use  parallel  architectures  as  a  computational  tool  to  solve 
problems  will  mark  an  increasing  maturity  in  this  field. 

Applying  object-oriented  design  and  programming  paradigms  to  the  parallel 
world  may  hold  a  great  deal  of  promise.  In  particular,  the  C++  language  seems  to 
be  a  prudent  choice  for  parallel  programming. 

In  addition  to  the  more  practical  options,  the  study  of  parallel  theory  and  al- 
gorithms seems  interesting  and  shows  a  great  need  for  development.  In  particular, 
this  field  seems  to  need  a  more-or-less  general  (at  least  for  MIMD  machines)  ap- 
proach to  classifying  parallel  algorithms  and  specifying  their  performance.  As  noted 
in  Chapter  IV,  a  mixture  of  this  field  with  graph  theory  may  hold  a  great  deal  of 
promise. 

On  an  initial  glance,  the  use  of  the  Ada  programming  language  with  its  inbuilt 
tasking  constructs  might  seem  optimum  for  the  type  of  computing  investigated  in 
this  thesis.  Ada,  in  this  regard,  however,  is  optimized  for  use  with  shared  memory 
multiprocessors.  The  use  of  Ada  on  transputers  still  requires  much  experimentation 
and  better  tools.  Presently  only  one,  rather  expensive,  Ada  compiler  is  available  for 
transputer  use.  Its  required  use  of  occam  harnesses  makes  using  Ada  on  transputers 
awkward  at  best.  Further  research  is  needed  to  create  a  better  environment  for  Ada 
programming  on  transputers.  Given  the  significance  of  Ada  to  the  DoD  establish- 
ment, this  should  become  a  priority.  The  inclusion  of  a  standard  math  package  and 
the  advent  of  Ada  9X  may  hold  some  promise  in  this  regard. 
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APPENDIX  A 
NOTATION  AND  TERMINOLOGY 


This  appendix  explains  the  shorthand  used  in  the  rest  of  the  thesis.  Con- 
ventions, by  definition,  are  generally  accepted  rules  of  the  business.  This  would 
seem  to  obviate  the  need  for  further  discussion  of  conventions,  but  there  are  sev- 
eral good  reasons  for  discussing  notation  and  terminology.  First,  the  notation  may 
not  be  conventional.  In  the  absence  of  convention  (or  when  the  foundation  that  it 
provides  is  inadequate)  a  more  substantial  agreement  is  required.  Second,  even  for 
conventional  notation,  the  audience  may  be  diverse  enough  to  warrant  familiariza- 
tion. The  following  discussion  provides  this  familiarity  and  gives  the  terms  of  an 
agreement  to  establish  the  meaning  of  the  words  and  symbols  used  in  the  rest  of 
the  work.  On  occasion,  neither  convention  nor  this  agreement  will  suffice.  These 
situations  will  be  handled  case-by-case  with  the  philosophy  that  clarity  should 
never  be  sacrificed  for  brevity. 

A.   BASICS 


Most  of  the  work  deals  with  the  integers,  Z  (from  the  German  word  for  numbers, 
Zahlen),  the  set  of  real  numbers,  R,  and  the  complex  numbers,  C  .  Often,  the 
German  3R  is  used  to  represent  the  reals.  A  complex  number  is  a  number,  x  +  iy  = 
z  G  C,  that  has  a  real  part  (x  G  3?)  and  an  imaginary  part  (y  G  S),  with  the  complex 
unit  i  =  y/—l.  Sometimes  the  real  part  is  denoted  Re(r)  and  Im(*)  is  used  to 
represent  the  imaginary  part. 

A  scalar  is  simply  a  real  number,  and  is  usually  denoted  by  a  lower-case  Greek 
letter.1  A  vector  is  an  ordered  set  of  scalars.  Lower-case  Latin  letters  like  6,  x,  and 
y  are  used  to  denote  vectors.  Sometimes  an  arrow  is  placed  above  the  name  of  a 
vector — like  x — to  emphasize  the  fact  that  it  is  a  vector. 


'The  Greek  alphabet  is  shown  in  the  Table  of  Symbols. 
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Matrices  are  two  dimensional  and  usually  contain  real  or  complex  elements. 
Capital  letters  (Greek  or  Latin)  are  used  to  represent  matrices.  Common  examples 
include  A,  P,  Q,  R,  A,  and  E. 

The  number  systems  introduced  above  cannot  be  represented  in  a  finite  space. 
There  are  two  basic  problems.  First,  we  should  consider  the  size  (or  cardinality)  of 
the  sets.  The  integers  are  countable  or  denumerable  since  there  exists  a  one-to-one 
mapping  between  Z  and  the  natural  numbers,  N.  This  is  an  advantage  in  finite 
storage  since  it  means  that  we  can  choose  a  finite  range  of  the  integers  and  be  quite 
certain  that  every  integer  in  that  range  is  represented  (exactly).  Even  though  Z  is 
denumerable,  it  is  a  set  with  infinite  cardinality. 

The  real  numbers  present  a  more  difficult  situation  for  finite  storage.  The  real 
number  line  is  dense  in  comparison  to  the  integers.  3?  is  not  only  an  infinite  set,  it  is 
not  countable  (i.e.,  9R  is  uncountable).  It  is  said  to  have  the  power  of  the  continuum. 
To  represent  a  real  number,  x,  we  use  the  floating-point  approximation,  fl(x),  to  x. 
This  is  a  number  that  may  be  described  by  three  parts:  the  sign  s,  the  exponent  e, 
and  the  mantissa  d.  An  illustration  of  such  a  number  is  provided  in  Chapter  II. 

B.   COMPLEX  NUMBERS 
1.    Notation 

The  previous  section  introduced  one  notation  for  complex  numbers;  namely, 
z  =  x  +  iy.  There  are  several  other  representations,  each  of  which  makes  its  own 
contribution  in  practical  use.  Electrical  engineers  usually  replace  the  i  with  j  since  i 
is  used  to  represent  electrical  current.  Since  the  complex  number  can  be  represented 
by  an  ordered  pair  of  real  numbers,  the  graphical  notation  of  Figure  A.l  is  natural. 
In  this  plane,  the  real  and  imaginary  axes  are  used  to  represent  the  components  of 
a  complex  number. 
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Figure  A.l:  The  Complex  Plane 

The  vector  sum  of  these  two  parts,  z  =  x  +  y,  is  an  equivalent  and  useful 
way  to  model  complex  numbers.  There  is  yet  another  way  to  describe  z.  Let  r  be  the 
magnitude  of  the  vector  z  and  let  6  be  the  angle  measured  from  the  positive  real  axis 
counter-clockwise  to  z.  Using  this  notation,  we  could  use  trigonometry  to  describe 
the  complex  number  as  z  =  r(cos  0  +  i  sin  9).  The  Euler  formula  [Ref.  32:    p.  74], 


e*  =  e      y  =  cxt,y  =  ex(cosy  +  zsiny), 


(A.l) 


can  be  used  to  convert  a  complex  number  to  yet  another  form:    z  =  rexl 
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2.    Operations 

a.    Addition  and  Subtraction 

Addition  and  subtraction  of  complex  numbers  is  performed  in  the  same 
manner  that  vectors  are  added  or  subtracted.  For  instance,  let  zx  =  a  +  ib  and  let 
z2  =  c  —  id.  Then  the  sum,  Z\  +  z2,  is  t*ne  same  as  the  sum  of  the  corresponding 
vectors: 


Zi  +  z2  = 


a 
b 

+ 

c 
-d 

= 

a  -\-  c 
b-d 

(A.2) 


so  the  sum  is  Z\  4-  z2  =  (a  +  c)  +  ?(6  —  </).  Differences  are  handled  in  the  obvious  way, 
as  vector  differences. 


b.    Multiplication 

Multiplication  is  performed  by  applying  high  school  algebra.    For  the 
same  complex  numbers  zx  and  z2' 

:,x;2  =  (a  +  ib){c  -  id)  =  ac  -  (a)(id)  +  {ib)(c)  -  (ib)(id)  (A. 3) 

and  using  the  definition  of  the  complex  unit,  i  =  y/—l  ,  we  may  combine  the  middle 
terms  and  move  the   i2  =  —1    outside  the  last  term  to  find  the  (complex)  product: 


z\  x  z2  =  ac  —  i(ad  —  be)  +  bd  =  (ac  +  bd)  —  i(ad  —  be) 


(A.4) 


c.  Conjugation 

The  complex  conjugate  of  a  complex  number  z  =  x  +  iy  is  defined  as 
z  =  x  —  iy.  This  simple  operation  finds  practical  application  in  complex  division. 

d.  Division 

Consider  the  quotient  (z^j ' z2)  of  the  same  complex  numbers  that  were 
used  in  equations  A.2,  A. 3,  and  A.4.    If  we  multiply  both  the  numerator  and  the 
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denominator  by  the  complex  conjugate  of  the  denominator,  22,  we  have: 

z,        a  +  ?6_    {a  +  ib){c  +  id)       ac  +  i(ad)  +  i(bc)  +  i2(bd) 
z-2       c  —  id       (c  —  id)(c  +  id) 

and  then,  by  applying  i    =  — 1,  we  conclude: 


c2  -  i2d 


1A1 


(A.5) 


2i        ac  —  bd  +  i(bc  +  ad)       (ac  —  bd)       .(be  +  ad) 

72  =         c^TI2         ==  (c2  +  <P)  +  *(c2  +  </2) 


(A.6) 


As  a  practical  matter,  this  is  not  the  way  we  would  compute  a  complex  quotient. 
The  code  given  in  Appendix  F  (function  cdiv()  in  complex. h)  provides  a  method 
that  is  better  suited  to  the  finite  precision  environment. 

C.  VECTORS  AND  MATRICES 
1.    Columns  and  Rows 

Vectors  are  ordered  collections  of  scalars  represented  as  columns.     Let 
q,/?,7  €  Cwith  q  =  1.0  +  74.0,  /?=  2.0-i5.0,  and  7  =  3.0  +  26.0.  Then: 


x  — 


Q 

"  1.0  +  i4.0 

P 

= 

2.0  -  t'5.0 

7  . 

3.0  +  i'6.0 

If  row -orientation  is  intended  the  transpose  is  used: 

xT  =  [  a  0  7  ]  =  [  (1.0  +  74.0)   (2.0  -  £5.0)   (3.0  +  z6.0)  ] 

Matrices  may  be  formed  as  ordered  combinations  of  elements,  vectors,  or  blocks. 
Suppose  that  p.  =  3.0  and  v  =  7.0.  Then,  with  x  as  given  above,  the  following 
matrices  are  equivalent: 


A  = 


x    px    vx 


1.0 +  i4.0    3.0  +  i'12.0     7.0  +  228.0 
2.0-25.0    6.0-J15.0    14.0-235.0 
3.0  +  i6.0    9.0  +  218.0    21.0  +  242.0 


(A.7) 


An  element  within  a  matrix  is  usually  denoted  A(i,j),  where  2  is  the  row  index  and 
j  is  the  column  index.  For  instance,  ;4(1,3)  =  7.0  +  228.0  in  (A.7). 
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A  block  of  the  matrix  A  is  a  rectangular  matrix  B  within  A.  MATLAB 
notation  is  useful.  For  instance,  B  =  A(i  :  j,k  :  I)  means  that  B  is  the  block  of  >4's 
rows  i  through  j  and  columns  k  through  /.  The  row  or  column  ':'  means  all  rows  or 
all  columns.  For  instance: 


B  =  A(:,\  :2)  = 


l.O  +  t'4.0  3,0  +  il2.0 
2.0-25.0  6.0  —  il5.0 
3.0  +  z'6.0    9.0  +  t'18.0 


(A.8) 


As  a  sidenote,  a  number  with  a  decimal  point  should  usually  be  taken  as 
a  real  number.  Mathematically  speaking,  1  =  1.0.  But  many  compilers  treat  1 
as  an  integer  and  use  the  decimal  point  to  recognize  1.0  as  a  floating-point  value. 
Therefore,  all  of  the  code  associated  with  this  work  and  most  of  the  examples  use 
the  decimal  point  as  a  clue  that  the  number  is  a  real  number  or  its  floating-point 
approximation. 

2.    Conjugation  and  Transposition 

The  conjugate  of  a  vector  or  matrix  is  simply  a  vector  or  matrix  whose 
entries  are  the  conjugates  of  the  original  entries.  A  superscript  C  is  used  to  denote 
the  conjugate  of  a  vector  or  matrix.  For  instance,  with  A  as  given  A. 7, 


Ac  = 


1.0  -  24.0  3.0-i'12.0  7.0  -  i28.0 
2.0  +  i5.0  6.0  +  i'15.0  14.0  +  235.0 
3.0-26.0    9.0-218.0    21.0-242.0 


(A.9) 


The  transpose  of  a  vector  or  matrix,  denoted  with  a  superscript  T,  refers  to 
a  transposition  of  its  rows  and  columns.  With  A  €  CmXn,  the  effect  of  transposition 
is  that  A(i,j)  =  AT(j,  i)  for  all  i  such  that  1  <  i  <  m,  and  all  j  so  that  1  <  j  <  n. 
For  example,  consider  the  transposition  of  the  matrix  A  that  is  found  in  equation  A. 7. 


AT  = 


r      t   i 

X 

T 

Hi 

T 

vx 

— 

1.0  +  24.0  2.0-i5.0  3.0  +  26.0 
3.0  +  i'12.0  6.0-215.0  9.0  +  218.0 
7.0  +  228.0    14.0-2'35.0    21.0  +  242.0 


(A.10) 
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In  this  example  we  see  that  the  columns  of  a  matrix  become  the  rows  of  its  transpose. 
This  example  also  demonstrates  that  when  we  first  transpose,  and  then  stack  the 
columns  of  a  matrix,  we  arrive  at  the  transpose  of  the  matrix.  In  the  event  that 
A  =  AT,  we  say  that  A  is  symmetric. 

The  conjugate  (or  Hermitian)  transpose  of  A  is  AH .  This  matrix  is  the 
result  of  combining  the  conjugation  and  transposition  operations  on  A.  The  following 
example  shows  the  Hermitian  transpose  of  A: 


A"  = 


1.0  -  i'4.0       2.0  +  t'5.0        3.0  -  i6.0 
3.0-212.0     6.0  +  tl5.0      9.0-t*18.0 
7.0  —  a'28.0    14.0  +  £35.0    21.0  —  i42.0 


(A.H) 


If  A  =  AH ,  we  say  that  M  is  Hermitian."  We  should  never  confuse  M  is  Hermitian'1 
with  M  Hermitian"  (the  conjugate  transpose,  A" ,  of  A).  [Ref.  33:    p.  294] 

3.  Zeros 

It  could  be  argued  that  zero  is  the  most  important  number.  In  addition  to 
its  use  as  a  number,  zero  is  also  used  to  represent  a  vector  or  matrix  in  which  every 
element  is  equal  to  zero.  In  the  (extremely  rare)  event  that  the  context  does  not 
clearly  indicate  the  size  of  a  "0-vector"  or  "0-matrix",  its  size  will  be  given  explicitly. 
In  the  absence  of  implied  or  specified  size,  0  should  be  interpreted  as  the  number 
zero.  Additionally,  blank  space  within  a  matrix  usually  means  that  all  elements  in 
that  region  are  zero. 

4.  Special  Forms 

a.    Axis  Vectors 

An  axis  vector,  et,  is  simply  the  t     column  (or  row)  of  the  identity 
matrix. 
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b.    Lower  Triangular 


A  lower  triangular  matrix,  usually  denoted  L,  has  the  form 


L  = 


x 

X       X 
XXX 


(A.12) 


If  L  has  ones  on  the  diagonal,  it  is  called  unit  lower  triangular.  Similarly,  the  upper 
triangular  matrix  U  has  the  form 


u  = 


xxx 
x     x 

X 


(A.13) 


U  is  called  unit  upper  triangular  if  the  diagonal  elements  are  all  ones.  Sometimes 
(e.g.,  Chapter  III)  such  a  matrix  is  called  right  triangular  and  denoted  R.  When  the 
matrix  is  not  square,  the  lower  and  upper  triangular  ideas  are  translated  to  lower  and 
upper  trapezoidal,  with  the  unit  trapezoidal  matrices  having  ones  on  the  diagonal. 
The  following  matrices  illustrate  the  different  kinds  of  trapezoidal  matrices.  The 
matrices  may  be  tall  and  skinnv  as 


U  = 


X 


X       X 

X       X 

X 


or  short  and  fat 


u  = 


D.   NORMS 


X       X       X       X 
X       X       X       X 

xxx 


L  = 


X 

X       X 

xxx 
xxx 
xxx 


X 

X       X 

X       X 


(A.14) 


(A.15) 


The  information  below  was  taken  from  [Ref.  21 :   pp.  53-60],  so  it  seems  fitting 
to  begin  with  a  few  of  Golub  and  Van  Loan's  comments  on  norms. 
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Norms  serve  the  same  purpose  on  vector  spares  that  absolute  value  does  on 
the  real  line:  they  furnish  a  measure  of  distance.  More  precisely,  9?"  together  with 
a  norm  on  ?Rn  defines  a  metric  space.  Therefore,  we  have  the  familiar  notions 
of  neighborhood,  open  sets,  converge  nee ,  and  continuity  when  working  with  vectors 
and  vector-valued  functions. 

1.    Vector  Norms 

a.  Definition 

A  vector  norm  on  3Rn  is  a  function  /  :  3Rn  — >  9?  that  satisfies  the  following 
properties  [Ref.  21  :    p.  53]: 

/(*)>0    iGr,    (/(*)  =  0  i// s  =  0)  (A. 16) 

/(*  +  y) < /(*)  +  f(y)  *,</€*"  (A. 17) 

f{ax)  =|  a  |  f(x)    Q6S,i6r'  (A. 18) 

We  denote  such  a  function  with  a  double  bar  notation:  f(x)  =  ||  x  ||. 

b.  The  j^-Norm 

Subscripts  on  the  double  bar  are  used  to  distinguish  between  various 
norms.  The  most  popular  example  of  this  is  the  p-norm,  ||  •  ||p.  This  norm  is 
defined  by  [Ref.  21  :    p.  53] 

ii*iip=(i*ir+---+i*.p)'    p>l  (A-19) 

The  2-norm  is  the  one  used  most  frequently  in  this  work,  but  the  1-  and  oo-norms 
find  frequent  application  in  other  work.  A  natural  representation  of  the  2-norm  is 
the  square  root  of  an  inner  product 

II  x  ||2=  (|  i,  |2  +  •  •  •  +  |  xn  I2)?  =  y/x^  (A.20) 

The  2-norm  of  x  is  the  Euclidean  length  of  the  vector  x. 


2.    Matrix  Norms 

a.  Definition 

A  matrix  norm  on  3ftmxn  is  a  fund  ion  /  :  ftmXn  -*  5R  that,  satisfies 

properties  similar  to  those  presented  in  the  vector  case  [Ref.  21:    p.  56]: 

f(A)>()         AeW"*\         {f(A)  =  0iffA  =  0)    (A.21) 

f(A  |  H)-    f(A)  I  f{B)       /t,B6rxn  (A.22) 

/((»/!)  ---|o  | /(A)    Q6»M€»mX"  (A.23) 

Matrix  norms  ;» I s«. >  use  the  double  bar  notation:  f{A)  =  ||  A  ||.  The  Frobenius  norm 

and  the  />  norm  are  tli<'  most  common  matrix  norms 

b.  Frobenius 

The  FVobenius  norm  is  defined  as 


,EE  l  "...I-'- 


(A.21) 


c.    p-Norms 


The  />  norm  of  a  matrix,  .4,  is  defined  by 

II  Ax  II, 


-1  ||,,-  sup 


t#o    ||  -r 


(A.25) 


E.    LINEAR  SYSTEMS 


One  of  tin"  fundamental  tasks  of  linear  algebra  is  to  form  a  matrix  representation 
oi  a  system  of  linear  equations.  Consider  the  system  of  linear  equations: 


2uj    -f    3ua    —    4«3    =    7 

3t»]     -     5;/;     -f     7i/3     =     3 

4ti]  ■+  (ii/j  -  2U;t         =  1 


(A.26) 
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This  system  of  equations  can  be  expressed  using  the  matrix  notation  Au  =  b 


Au  = 


2 

3 

-4  ' 

'    Hi 

r  7  " 

3 

-5 

7 

u2 

= 

3 

4 

6 

-2 

.  "3   . 

1 

=  6 


(A.27) 


F.   MEASURES  OF  COMPLEXITY 

The  first,  and  most  rudimentary  requirement  for  an  algorithm  is  that  it  produce 
the  correct  answer.  This  seems  utterly  obvious,  but  it  must  never  be  lost  in  the 
algorithm  designer's  pursuit  of  the  next  most  important  elements — efficiency  in  using 
time  and  space.  For  the  moment,  we  shall  assume  that  the  algorithm  arrives  at  an 
acceptable  answer.  Then  the  algorithm's  use  of  time  and  space  becomes  a  very 
serious  subject.  Knuth  provides  the  notation  in  [Ref.  34]. 

The  time  complexity  of  an  algorithm,  also  known  as  running  time,  describes  how 
the  program  works  under  a  stopwatch.  Space  complexity  is  the  amount  of  temporary 
storage  required  to  carry  out  the  algorithm.  For  example,  suppose  a  person  stood  at 
a  chalkboard,  ready  to  solve  a  problem.  We  would  not  regard  the  input  or  output 
storage  space,  but  only  the  required  space  on  the  chalkboard,  in  the  space  complexity 
of  the  problem.  Usually  we  like  to  link  the  idea  of  complexity  to  the  input  size  of  the 
problem,  n.  The  following  discussion  of  time  complexity  outlines  a  few  tools  that 
are  standard  in  the  study  of  algorithms.  The  same  tools  and  ideas  apply  for  space 
complexity  analysis.  [Ref.  35 :    pp.  42-43] 

The  most  common  method  for  describing  the  time  complexity  of  an  algorithm 
is  the  "big-Oh"  notation  [Ref.  35  :  p.  39]. 2  A  function  g(n)  is  0(f(n))  if  there  exist 
constants  c  and  A'  so  that,  for  all  n  >  AT,   g(n)  <  cf(n). 


g(n)  =  0(f(n))  *=*  g(n)  <  cf(n),    n  >  N 


(A.28) 


'lO(f(n))  is  read  "order  /(n).r 
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This  means  that  for  a  large  enough  problem  size  n,  the  time  to  execute  g(n)  is  a 
constant  multiple  of  some  function,  f(n).  Big-Oh  notation  does  not  mean  a  least 
upper  bound,  only  an  upper  bound  for  n  sufficiently  large.  Practically,  0(f(n))  must 
be  augmented  so  that  we  may  determine  how  tightly  cf(n)  bounds  g{n). 

By  adding  a  lower  bound  to  big-Oh,  we  may  arrive  at  a  more  informative 
statement  concerning  an  algorithm's  complexity.  This  is  achieved  through  the  use  of 
"big  Omega".  T(n)  =  Q(g(n))  means  that  there  exist  constants  c  and  TV  such  that, 
for  all  n  >  N,  the  number  of  steps  T(n)  required  to  solve  the  problem  for  input  size 
n  is  at  least  cg(n). 

T(n)  =  n(g(n))  «=»  T{n)  >  cg{n),    n>N  (A.29) 

This  is  essentially  a  lower  bound  on  time  complexity.  If  a  function,  f(n)  satisfies 
both  f(n)  —  0(g(n))  and  /(??)  =  Q(g(n)) — not  necessarily  using  the  same  constants 
c  and  Ar  for  both  0  and  Q, — then  we  say  that  f{n)  =  Q(g(n)).  [Ref.  35:    p.  41] 

f(n)  =  0(g(n))  =  n(g(n))<=*f(n)  =  Q(g(n)),    n>N  (A.30) 

Now  and  then,  notation  similar  to  0  and  Q  is  required  except  that  a  strict  inequality 
is  desired.  In  this  case,  we  use  "little  oh"  and  "little  omega".  The  definitions  are: 

f(n)  =  o(g(n))  «=>   lim  M  =  0  ^>  g(n)  =  w(/(n))  (A.31) 

n-~°°  g(n) 

We  have  seen  that  0,  ft,  0,  o,  and  u  are  roughly  equivalent  to  the  inequalities 
<,  >,  =,  <,  and  >,  respectively.  Is  this  notation  meaningful?  Does  it  have  utility  in 
problem  solving?  The  answer  is  a  guarded  "yes."  We  must  understand  the  purpose 
of  the  notation.  It  cannot  substitute  for  timing  data  taken  from  the  actual  execution 
of  an  algorithm.  It  is  intended  as  a  good  first  estimate.  There  are  too  many  variables 
involved  in  modern  tools  and  machinery  to  expect  accurate  analysis  from  other  than 
actual  execution. 
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TABLE  A. 1:  ALGORITHM  COMPLEXITY  AND  MACHINE  SIM  EI) 


Algorithm 
Comploxity 

Execution  Time  (in  Seconds)  for  Machine  Speed 

1000  steps/sec 

2000  steps/sec 

4000  steps /sec 

8000  steps/sec 

log2  n 

0.01 

0.005 

0.003 

0.001 

71 

1 

0.5 

0.25 

0.125 

ji  log2  n 

10 

5 

2.5 

1.25 

r,1.5 

32 

16 

8 

4 

7i2 

1,000 

500 

250 

125 

T73 

1,000,000 

500,000 

250,000 

125,000 

i.r 

1039 

1039 

1038 

1(EM 

Nevertheless,  a  rough  estimate  of  how  a  problem  grows  is  important  to  the  prob- 
lem solving  process.  Indeed,  experimental  results  and  complexity  analysis  should  not 
usually  be  considered  independently,  but  compared  and  used  as  complementary  in- 
struments. The  time  complexity  of  an  algorithm  is,  in  a  sense,  more  important  than 
the  speed  of  the  machine  upon  which  it  is  executed.  Consider  the  data  in  Table  A.l 
(adapted  from  [Ref.  35:  p.  41]).  This  is  based  upon  a  problem  of  size  n  =  1000  and 
demonstrates  the  ability  of  an  algorithm  to  dominate  a  machine.  For  this  reason, 
and  with  these  conditions  clearly  established,  we  will  find  many  occasions  to  use 
time-  and  space-complexity  notation. 

Finally,  the  two  most  common  performance  measures  for  para  I  Id  computing 
are  speedup  and  efficiency.  Suppose  that  Tn  is  the  time  of  execution  for  a  particular 
algorithm,  /I,  on  n  processors.  Consider  the  best  uniprocessor  time  1\  for  a  sequential 
version  of  A  compared  to  the  execution  of  an  equivalent  (not  necessarily  the  same) 
parallel  program  on  P  processors  that  executes  in  time  Tp.  Then  speedup,  Sp,  is 
defined  as 


SP  = 


ZL 

TP 


and  the  efficiency,  Ep,  is  defined  to  be 


F    -Sp 
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APPENDIX  B 
EQUIPMENT 

A  transputer  is  a  microcomputer  with  its  own  local  memory  and  with  links 
for  connecting  one  transputer  to  another  transputer. 

The  transputer  architecture  defines  a  family  of  programmable  VLSI  com- 
ponents. The  definition  of  the  architecture  falls  naturally  into  the  logical  as- 
pects which  define  how  a  system  of  interconnected  transputers  is  designed  and  pro- 
grammed, and  the  physical  aspects  which  define  how  transputers,  as  VLSI  compo- 
nents, are  interconnected  and  controlled. 

A  typical  member  of  the  transputer  product  family  is  a  single  chip  containing 
processor,  memory,  and  communication  links  which  provide  point  to  point  con- 
nection between  transputers.  In  addition,  each  transputer  product  contains  special 
circuitry  and  interfaces  adapting  it  to  a  particular  use.  For  example,  a  peripheral 
control  transputer,  such  as  a  graphics  or  disk  controller,  has  interfaces  tailored  to 
the  requirements  of  a  specific  device. 

A  transputer  can  be  used  in  a  single  processor  system  or  in  networks  to  build 
high  performance  concurrent  systems.  A  network  of  transputers  and  peripheral 
controllers  is  easily  constructed  using  point-to-point  communication. 

—  INMOS 

This  introduction  is  provided  by  the  transputer's  maker  in  [Ref.  36:    p.  7]. 

A.  TRANSPUTER  MODULES 

INMOS  makes  a  wide  variety  of  microprocessors  to  suit  differing  needs.  To 
provide  a  simple,  modular  interface  they  have  developed  the  notion  of  a  transputer 
module  (TRAM).  The  TRAM  is  a  small  board  containing  the  microprocessor,  RAM, 
other  circuitry,  and  a  standard  sixteen  signal  interface. 

B.  THE  IMS  B012 

Most  of  the  later  experiments  were  carried  out  on  an  IMS  B012  board.  This 
board  accommodates  sixteen  transputers;  each  of  which  is  installed  on  its  own  IMS 


134 


B401  TRAM.  In  our  case  the  TRAM  holds  32  kilobytes  of  memory  (in  addition  to 
the  four  kilobytes  onboard  the  T800-20  transputer). 

d.    INMOS  Transputers 

The  INMOS  transputer  gives  the  system  designer  a  tremendous  amount 
of  latitude.  With  these  processors — perhaps  more  than  with  any  other  parallel 
architecture — one  should  give  careful  thought  to  the  size,  component  processors,  and 
interconnection  topology  as  the  first  elements  in  designing  a  solution  to  a  problem. 
This  cannot  be  overemphasized.  When  the  hardware  is  not  "general  purpose"  in  na- 
ture, it  must  receive  thoughtful  consideration  along  the  path  to  solving  the  problem. 
Some  of  the  largest  applications  for  parallel  machines — especially  for  transputers — 
are  embedded  systems. 

An  embedded  computer  system  is  defined  as  "one  that  forms  a  part  of 
a  larger  system  whose  purpose  is  not  primarily  computational."  [Ref.  37:  pp.  15-16] 
To  automatically  accept  or  assume  a  particular  machine  configuration  is  to  relinquish 
control  of  one  of  the  tools  available  in  system  design. 

Transputer  is  the  name  given  to  the  members  of  a  family  of  microproces- 
sors. While  INMOS  is  the  largest  producer  of  these  processors,  they  have  not  chosen 
to  protect  the  name  transputer  with  any  sort  of  trademark.  The  name  comes  from 
a  combination  of  "transistor  computer1'  and  each  transputer  is  essentially  a  com- 
puter on  a  chip.  The  chip  possesses  an  arithmetic  logic  unit  (ALU),  memory,  and  a 
communication  system  that  supports  bidirectional  serial  communication  links.  Most 
of  the  transputers  used  for  this  research  also  include  a  64-bit  (IEEE  754  standard) 
floating-point  unit  (FPU). 

The  transputer  module  (TRAM)  is  the  most  common  package  for  trans- 
puters. The  capabilities  of  these  modules  are  quite  diverse,  but  they  hold  to  a 
standard  interface  design.    This  makes  the  TRAM  easy  to  use.    Systems  designed 
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around  TRAMS  enjoy  simple  replacement  of  components,  ease  of  modification,  and 
great  scalability.  Indeed,  the  laboratory  environment  in  which  these  TRAMs  were 
exercised  is  a  very  dynamic  one. 

The  PARCDS  laboratory  has  six  80286-based  IBM-compatible  personal 
computers,  each  of  which  contains  a  transputer  interface  board.  Five  hold  IMS  B004 
boards  and  one  holds  a  Transtech  TMB08  board.  The  B004  boards  each  have  two 
megabytes  of  memory  and  an  IMS  T414  transputer  in  addition  to  the  requisite 
serial-to-parallel  converter  and  interface  circuits.  The  TMB08  holds  four  megabytes 
of  memory  and  an  IMS  T800-20  transputer.  These  "host"  machines  can  each  be 
connected  to  an  arbitrarily  large  network  of  transputers. 

For  this  purpose,  we  have  two  INMOS  Transputer  Evaluation  Module 
(ITEM)  boxes.  These  boxes  can  hold  at  least  ten  boards  of  the  Double  Eurocard  size 
(approximately  22  cm  x  23.5  cm).  Of  primary  interest  for  this  thesis  was  the  IMS 
B012  board;  a  motherboard  capable  of  supporting  sixteen  TRAMs.  For  this  research, 
all  sixteen  slots  were  filled  with  a  TRAM  that  held  an  IMS  T800-20  transputer  and 
32  kilobytes  of  TRAM  memory  (in  addition  to  the  transputer's  four  kilobytes).  The 
shortage  of  memory  is  probably  the  greatest  deficiency  and  indicator  of  the  outdated 
nature  of  these  processors.  TRAMs  with  four  and  eight  megabytes  of  memory  and 
IMS  T805-25  transputers  are  currently  available  for  less  than  $900.00  and  $1,300.00 
respectively. 

e.    Intel  iPSC/2 

The  iPSC/2  used  for  this  research  contained  eight  node  processors  of 
the  "CX"  type  (80386/80387  combination).  Like  the  transputers,  this  machine  is 
somewhat  dated.  Today's  i860  chips  have  exceedingly  more  capacity. 
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C.   SWITCHING  METHODS 

The  iPSC/2  and  transputer  hardware  use  of  different  switching  methods.  Intel 
uses  a  circuit  switching  approach,  whereas  the  INMOS  approach  is  store-and-forward 
switching.  Each  approach  has  advantages  and  disadvantages.  The  circuit  switching 
approach  is  "almost  universally  used  for  telephone  networks."  [Ref.  38:  p.  12]  The 
idea  is  to  first  define  a  path  (close  a  circuit)  from  the  source  to  the  destination  and 
then  use  it  as  a  dedicated  line. 

This  requires  a  start-up  overhead  that  depends  entirely  upon  the  current  load 
being  handled  by  the  system.  If  any  part  of  the  medium  (links  or  switches)  between 
the  source  and  destination  is  busy,  the  message  will  wait  at  the  source  until  the 
entire  path  is  clear.  The  path  is  determined  (in  the  iPSC/2  case)  in  a  deterministic 
fashion,  so  that  a  message  from  node  i  to  node  j  will  always  insist  on  a  particular 
path,  even  if  some  other  communication  is  blocking  that  path.  As  the  path  becomes 
clear,  switches  between  the  source  and  destination  are  set  so  that  a  dedicated  line 
will  exist  from  source  to  destination. 

After  the  overhead  of  establishing  (closing)  the  circuit  has  been  paid,  commu- 
nication proceeds  at  a  rapid  rate.  The  intermediate  nodes  along  the  path  do  not 
store  the  message.  Instead,  their  switches  have  been  set  so  that  the  message  flows 
through.  Intuitively,  this  approach  should  be  quite  effective  in  a  network  with  a  very 
structured  interconnection  topology  and  a  relatively  small  number  of  nodes.  The 
hypercube  gives  us  this  structure.  Hypercubes  of  order  three  or  four  are  probably 
small  enough  to  avoid  difficulties  that  might  arise  as  many  nodes  contend  for  the 
same  medium. 

The  store-and-forward  approach  does  not  require  the  availability  of  the  entire 
path  between  source  and  destination  nodes.  Instead,  each  node  along  the  path  ac- 
cepts the  entire  message  in  turn  and  then  forwards  it  to  the  next  node  in  the  path. 
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This  requires  the  use  of  no  more  than  one  link  at  a  time.  For  a  many-node  environ- 
ment (particularly  if  there  is  little  structure  or  the  potential  of  dynamic  routing),  this 
approach  would  seem  to  offer  some  advantages  over  the  circuit  switching  approach. 
The  routing  criteria  is  separate  from  the  type  of  switching  used.  Either  of 
the  two  general  approaches  described  above  can  support  many  forms  of  routing. 
Deterministic  approaches  alone  include  many  methods.  For  the  hypercube  topology 
with  Gray-coded  node  labels,  it  is  probably  useful  to  combine  the  Gray  code  with 
the  notion  of  Hamming  distance  to  arrive  at  a  shortest  path  route.  Even  with  this 
approach,  there  are  as  many  optimum  paths  between  two  nodes  i  and  j  as  the 
Hamming  distance,  H(i,j),  between  them.  [Ref.  39:  p.  7].  If  a  dynamic  scheme 
is  used  to  determine  the  path,  there  are  even  more  combinations  of  potential  paths 
from  i  to  j.  Usually  a  dynamic  approach  considers  media  utilization,  "hot  spot" 
avoidance,  and  so  on. 
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APPENDIX  C 
INTERCONNECTION  TOPOLOGIES 

Multiprocessor  computing  brings  with  it  a  fundamental  concern:  interproces- 
sor  communication.  Communication  is — to  any  designer  of  computing  machinery 
or  software — a  burden  and  hindrance.  An  interconnection  topology  describes  the 
network  that  handles  this  load.  The  hypercube  is  one  of  the  many  topologies  used 
in  multiprocessor  computing.  It  has  been  the  subject  of  both  hype  and  criticism. 
Nevertheless,  this  particular  scheme  possesses  the  qualities  that  quickly  draw  the 
attention  of  mathematicians  and  parallel  programmers.  The  hypercube's  struc- 
ture and  simplicity  make  it  dependable  and  predictable.  The  same  properties  that 
enable  the  hypercube  to  endure  the  rigor  of  mathematical  proof  lead  to  practi- 
cal solutions  in  parallel  programming.  This  discussion  describes  the  hypercube 
topology  and  explores  some  of  the  the  qualities  that  make  it  a  practical  choice  for 
multiprocessor  computing. 

A.  A  FAMILIAR  SETTING 

Organizing  processors  into  a  suitable  topology  is  analogous  to  the  familiar  prob- 
lem of  organizing  personnel  into  groups.  An  independent  worker  has  limited  capacity, 
so  we  often  set  more  hands  (or  machinery)  to  the  task  for  productivity's  sake.  Groups 
of  people  are  often  less  efficient.  Efficiency  is  a  ratio  of  time  spent  doing  useful  work 
to  the  total  time  spent.  Other  metrics  might  work,  but  time  is  universally  recog- 
nized as  the  standard  against  which  productivity  is  measured.  Dependence  upon 
others  requires  communication  and  consumes  time.  The  loss  may  be  mini- 
mized, but  not  avoided.  Any  group  working  toward  a  common  goal  must  deal  with 
this  problem.  To  be  efficient,  an  organization  must  possess  structure  and  media  for 
communication. 

People  spend  time  on  meetings,  paperwork,  and  peripheral  pursuits — all  for 
the  sake  of  an  organization  that  hopes  to  outperform  the  individual.  Organizations 
typically  perform  tasks  that  are  simply  impossible  for  an  individual.  To  be  sure,  an 
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individual  often  possesses  the  independence  and  efficiency  that  makes  him  the  proper 
choice.  There  are  tasks  that  seem  to  fit  one  or  the  other  and — while  there  is  some 
crossover  in  ability — we  aren't  likely  to  get  rid  of  either  organizations  or  individual 
workers  soon!  This  is  worth  considerable  attention.  Individuals  and  organizations 
are  chosen  for  different  tasks. 

These  ideas  apply  in  the  world  of  parallel  processing.  First,  there  are  many 
tasks.  Some  fit  nicely  onto  a  single  processor.  Others  beg  a  parallel  solution.  Finally, 
some  have  natural  solutions  by  either  method.  Even  when  one  of  these  options  is 
selected,  there  are  many  ways  to  solve  the  problem.  If  a  multiprocessor  is  used  to 
solve  the  problem,  the  issue  of  communications  will  be  unavoidable. 

An  interconnection  topology  must  carry  the  burden  of  interprocessor  communi- 
cations. There  are  many  schemes  for  handling  this  mission.  This  discussion  focuses 
on  one  design  that  fulfills  that  mission:  the  hypercube.  To  forestall  confusion:  the 
subject  is  an  interconnection  topology,  not  a  particular  vendor's  product. 

B.   APPEAL  TO  INTUITION 

Productivity  can  suffer  when  the  members  of  an  organization  communicate 
excessively.  A  lack  of  communication  can  also  reduce  efficiency.  In  a  network  of 
processors,  lines  of  communication  (links)  are  literal.  The  system  will  not  be  flexible 
if  there  is  a  shortage  of  links,  but  with  too  many  links  a  message  could  get  delayed 
or  lost  in  the  confusion.  The  hypercube  attempts  to  strike  a  balance. 

Hypercubes  come  in  different  sizes.  In  fact,  scalability  is  a  key  characteristic  of 
the  hypercube.  It  allows  the  designer  to  tailor  a  network  to  a  problem.  There  are 
several  ways  to  express  the  cube's  size:  order  is  one  measure.  The  term  "hypercube 
of  order  n"  (usually  called  an  n-cube)  is  filled  with  meaning.  A  more  detailed  de- 
scription is  given  later,  but  pictures  provide  the  most  direct  introduction.  Figure  C.l 
shows  hypercubes  of  order  n  where  n  E  {0,1,2,3}. 
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Figure  C.l:  The  Four  Smallest  Hypercubes 

This  illustration  is  important.  The  hypercube  shows  geometry,  structure,  and 
symmetry.  A  few  observations  nearly  jump  out  of  the  pictures.  One  can  see  several 
terms  of  a  geometric  series  developing.  There  is  also  a  recurrence  relation  at  work 
in  the  building  of  hypercubes.  Intuition  suggests  the  use  of  well-oiled  mathematical 
tools  to  analyze  the  hypercube. 

C.   TOOLS 

Many  benefits  may  be  derived  from  a  few  definitions,  conventions,  and  tools 
(that  suit  the  hypercube's  structure).  Figure  C.2  demonstrates  the  utility  of  Carte- 
sian coordinates  in  n-dimensional  space. 

The  picture  is  deceptively  simple,  but  worth  careful  study.  Figure  C.2  shows  a 
unit  cube  in  three  dimensions.  The  vertex  labels  express  {xyz)  position  in  the  coor- 
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Figure  C.2:  Cartesian  Coordinates  for  a  3-Cube 

dinate  system.  The  labels  also  form  a  binary  (Gray)  code  that  is  somehow  equivalent 
to  coordinate  labeling  of  a  cube  in  n-dimensional  space.  The  issue  of  communica- 
tions invoked  this  discussion,  so  distance  must  be  addressed.  A  comparison  of  the 
binary  labels  of  any  two  nodes  reveals  that  the  distance  between  the  nodes  is  equal  to 
the  number  of  bits  that  differ  in  the  labels.  This  measure,  called  Hamming  distance, 
and  the  Gray  code  are  presented  in  more  detail  later. 

This  brief  introduction  is  just  enough  to  embark  upon  a  more  precise  descrip- 
tion of  the  hypercube.  The  ideas  of  a  coordinate  system,  node  labeling,  and  distance 
are  fundamental.  Graph  theory  also  finds  application  in  topology  design.  In  the  hy- 
percube these  four  tools  complement  each  other  nicely.  Despite  their  simplicity  they 
can  be  explored  in  almost  endless  detail,  even  within  the  constraints  of  hypercube 
structure. 
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D.   DESCRIBING  THE  HYPERCUBE 

The  hypercube  interconnection  topology  cannot  be  captured  in  a  one-sentence 
definition.  A  definition  is  often  inappropriate  for  material  objects.  A  description 
given  from  several  perspectives  may  be  more  useful.  This  is  the  case  with  topologies. 
Each  tool  introduced  above  has  its  own  utility.  In  a  sense,  each  takes  up  a  particular 
perspective.  A  meaningful  characterization  of  the  hypercube  can  be  achieved  by 
combining  these  perspectives. 

The  geometric  view  is  most  useful  for  visualizing  the  cubes.  Despite  its  ten- 
dency to  break  down  (with  three-dimensional  limitations),  geometry's  intuitive  ap- 
peal is  indispensable.  Geometry  and  pictures  lay  the  foundation  for  the  setting  of 
an  undirected  graph.  Figures  C.l  and  C.2  take  advantage  of  geometry,  but  three- 
dimensional  sketches  begin  to  lose  their  appeal  as  order  increases.  Nevertheless, 
geometry  and  visual  models  hold  an  important  place  in  describing  the  hypercube. 
They  furnish  us  with  (a)  examples  for  comparison,  and  (b)  expectations  that  are 
useful  in  the  transition  to  a  more  general  description  of  the  topology. 

A  hypercube  of  order  n  may  be  described  as  a  set  of  2"  points  (vertices,  nodes, 
or  processors)  connected  by  a  set  of  edges.  The  points  are  each  given  an  n-bit 
binary  label,  bn  . . .  b3b2bx.  Thus  the  hypercube's  node  labels  exhaust  all  possible  71- 
bit  binary  combinations.  Furthermore,  the  labeling  convention  used  in  Figure  C.2 
describes  the  point's  n-dimensional  Cartesian  coordinates. 

The  hypercube  edge  set  (communication  links)  includes  an  edge  between  every 
pair  of  points  pt  and  Pj  whose  binary  labels  differ  in  exactly  one  bit  position,  say  6^. 
That  is,  adjacent  nodes  have  a  Hamming  distance  of  one.  This  measure  of  distance 
proves  especially  convenient  in  the  hypercube,  and  it  can  be  thought  of  in  several 
equivalent  ways.  A  first  definition  of  Hamming  distance  is  the  number  of  bits  that 
differ  in  the  two  labels.   Equivalently,  it  is  the  number  of  l's  in  a  bitwise  exclusive 
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or  (XOR)  of  the  numbers.  Figure  C.2  contains  an  example.  Let  pt  be  the  point 
labeled  100  and  p3  be  110.  The  binary  labels  differ  in  exactly  one  bit  position, 
namely  b2  (the  second  bit).  The  points  are  neighbors  (one  hop  from  each  other  in 
communications  terms).  [Ref.  40] 

Despite  the  appeal  of  the  geometric  approach,  it  holds  limited  value  in  a  gen- 
eral n-dimensional  space.  Consider  n  —  4  in  three  dimensions.  Typical  illustrations 
show  the  sixteen-node  cube  as  a  cube  inside  a  cube  with  connections  between  corre- 
sponding nodes  of  the  inner  and  outer  cubes.  An  equivalent  diagram  would  display 
two  3-cubes  side-by-side  with  connections  to  corresponding  nodes.  Nevertheless,  it 
seems  that  an  n-dimensional  coordinate  system  is  the  most  convenient  environment 
for  sketching  the  hypercube  of  order  n. 

E.   GREATER  DIMENSIONS 

Three-dimensional  sketches  become  difficult  to  manage.  The  time  comes  for  a 
change  of  method.  Some  of  the  finest  tools  available  for  spanning  such  a  gap  are 
recurrence  relations  and  the  principle  of  mathematical  induction.  The  approach  is 
not  extremely  formal,  but  those  so  inclined  will  not  find  it  hard  to  add  the  formalities. 

Induction  can  be  used  to  generate  a  Gray  code  suitable  for  labeling  the  nodes 
of  a  hypercube.  This  code  and  the  Hamming  distance  can  be  used  to  determine 
the  cube.  The  first  topic  is  a  procedural  description  of  how  to  build  hypercubes.  A 
Gray  code  construction  procedure  will  follow.  If  the  two  topics  appear  similar,  it  is 
because  they  are  completely  equivalent  (assuming  that  the  Gray  code  is  combined 
with  the  concept  of  Hamming  distance). 

Constructing  a  hypercube  of  order  zero  is  trivial.  This  is  not  important  except 
that  it  leads  to  greater  things  (i.e.,  it  is  the  basis  for  induction).  Second,  suppose 
that  this  hypothesis  for  induction  is  true:  "we  know  how  to  construct  any  hypercube 
of  order  k  where  0  <  k  <  n".    Induction  forms  a  hypercube  of  order  n  using  this 
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base  case  and  hypothesis.  This  can  be  done  in  three  steps: 

•  Replicate  the  Hypercube  of  Order  (n  —  1)  so  that  there  are  two  identical  copies. 
For  concreteness,  one  will  be  copy  number  0  and  the  other  will  be  copy  number 
1.  The  hypercubes  have  2*n_1)  nodes  each. 

•  Prepend  the  copy  number  to  the  existing  node  labels.  That  is,  place  a  leading  0 
in  front  of  the  labels  for  each  node  of  copy  0  and  place  a  1  in  front  of  every  node 
label  in  copy  1.  Now  every  node  in  one  copy  has  a  corresponding  node  in  the 
other  copy.  These  corresponding  nodes  are  separated  by  a  Hamming  distance 
of  one.  That  is,  the  last  (n  —  1)  bits  are  the  same  for  corresponding  nodes  and 
they  differ  only  in  the  prepended  copy  number. 

•  Connect  all  nodes  whose  labels  differ  only  in  the  prepended  copy  number.  This 
adds  2^"-1^  edges  between  the  two  copies. 

F.   GRAY  CODE  GENERATION 

The  procedure  above  generates  hypercubes.  By  focusing  on  the  vertex  labels, 
Gray  code  generation  can  be  discussed.  A  Gray  code  is  a  cyclic  list  of  all  of  the  n-bit 
numbers  which  changes  in  only  one  bit  from  one  number  to  the  next  [Ref.  40].  Since 
the  code  is  binary,  there  are  2n  numbers  in  the  list.  The  starting  point  is  arbitrary 
(it  is  cyclic)  but  I  have  started  with  zero.  Perhaps  the  best  explanation  of  Gray 
codes  comes  in  the  construction  of  one.  As  in  the  construction  of  hypercubes,  a  base 
case  is  required  to  begin  generation. 

•  Start  with  0.  This  is  a  one-bit  number  (n  =  1)  so  the  one-bit  Gray  code  must 
have  a  total  of  21  =2  numbers.  The  other  is  1.  Next,  the  hypercube  building 
steps  established  above  are  applied  with  slight  modification. 

•  Given  the  one-bit  case,  it  is  easy  to  generate  the  n  =  2  code.  Write  down  the 
previous  code  and  draw  a  line  below  it.  Next,  form  a  copy  by  reflecting  the  code 
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TABLE  0.1:  CHAN'  OODK  (JKNKKATION 
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00 

000 

0000 
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01 

001 

0001 

11 
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10 

010 

0010 
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0110 
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0111 
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0101 

100 

0100 
1100 
1101 

1111 

1110 
1010 
1011 
1001 
1000 

downward  across  (Ik*  line.  Place'  a  zero  in  front  of  each  number  in  the  previous 
code  (above  (lie  line),  and  a  one  in  front  of  each  number  in  the  new  copy  (below 
the  line). 
•  This  is  a  Gray  code  for  u  =  2.  Table  0.1  extends  the  idea.  The  list  is  cyclic, 
each  number  consists  of  n  bits,  and  the  list  contains  all  2"  possible  numbers.  To 
construct  the  code  for  larger  it,  the  process  may  be  applied  repetitively.  Copy 
by  reflecting  the  (u  —  1)  bit  code  downward  across  a  line,  prepend  a  zero  to 
everything  above  the  (most  recent)  line,  and  prepend  a  one  to  those  below  that 


line. 


The  Gray  code  is  probably  the  most  useful  node  labeling  to  attach  to  the  hyper- 
cube.  This  code  often  appears  in  implementation.  The  program  listing  that  begins 
OH  page  152  shows  one  way  to  generate  the  code.   It  can  be  used,  for  instance,  as  the 
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backbone  of  a  routing  function  in  a  network.  Labels  with  a  Hamming  distance  of  one 
mark  neighbors  in  the  hypercube.  What  about  the  labels  of  two  nodes  that  difTer 
in  exactly  k  bits  (i.e.,  have  a  Hamming  distance  of  k)?  It  turns  out  that  k  is  the 
distance  (number  of  edges)  between  these  nodes.  For  all  communications  between 
these  nodes,  the  shortest  path  will  involve  k  hops. 

This  also  indicates  that,  for  an  n-cube,  there  is  no  pair  of  nodes  that  have 
a  Hamming  distance  of  more  than  n  (e.g.,  communication  between  nodes  0000010 
and  1111101  in  a  7-cube  can  be  achieved  in  seven  hops).  The  greatest  distance 
across  the  r?-cube  is  n  hops.  In  fact,  for  each  node  in  a  hypercube,  there  is  a  unique 
corresponding  node  at  a  Hamming  distance  of  n.  Also,  there  are  n  nodes  at  a 
Hamming  distance  of  one  from  each  of  the  hypercube's  nodes. 

Two  approaches  have  been  considered  so  far:  sketching  cubes  in  n-dimensional 
Cartesian  coordinates  and  studying  the  labels  associated  with  the  cubes.  Though 
the  approaches  are  fundamentally  different,  they  arrived  at  many  of  the  same  conclu- 
sions. Careful  application  of  the  Gray  code  and  Hamming  distance  could  produce  a 
nearly  endless  string  of  results,  but  it  is  more  convenient  to  introduce  some  material 
from  the  study  of  graphs  at  this  point.  Graph  theory  combines  the  two  approaches: 
it  looks  at  the  pictures  and  studies  the  numbers  as  well.  The  small  hypercubes 
described  with  earlier  methods  are  given  graph  representation  in  the  illustration  of 
Figure  C.3. 

G.   GRAPHS  OF  HYPERCUBES 

Graph  theory  is,  of  course,  much  more  sophisticated  than  the  small  subset 
used  here.  Buckley  and  Harary  provide  a  valuable  source  [Ref.  41].  This  discussion 
exposes  a  few  salient  features  of  the  hypercube  from  the  perspective  of  graphs. 

A  graph,  H ',  consists  of  a  vertex  set,  V(H),  and  an  edge  set,  E(H).  The  vertices, 
or  nodes,  in  the  multiprocessor  network  model  are  the  processors.  The  edges  are  the 
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Figure  C.3i  Hypercube  Graphs 

links  ili.ii  connect  the  processors.  I  will  avoid  using  the  term  order  in  its  graph 
theorj  sense  (i.e.,  number  of  nodes)  so  that  it  cannot  be  confused  with  the  order  of 
the  hypercube  Consider  the  graph,  //,,,  oi  a  hypercube  of  order  n,  The  graph  has 
i  h<".c  <  li.ii  .h  tei  ist 1'  ■ 

•  There  .uc  '.''■  nodes.  This  humus  thai  the  number  of  nodes  (i.e.,  processors) 
grows  vei  y  quickly  wit  li  order. 

•  Every  vertex,  <\  in  //,,  lias  eovntricity  <(r)  n.  Kovnt ric  it v  is  the  distance 
to  .i  node  farthest  from  u,  Additionally,  each  node  in  a  hypercube  has  exactly 
one eccentri*  (farthest)  node.  This  property  means  ih.»i  hypercubes  an-  unique 
<•* » cut  i  i<  node  ( ii  c  u  )  graphs. 
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•  The  radius  of  a  graph  is  the  minimum  eccentricity  of  the  nodes  and  diameter  is 
the  maximum  eccentricity.  The  hypercube  is  self-centered,  meaning  its  radius 
and  diameter  are  the  same:  r(//n)  =  d(Hn)  =  n.  This  is  significant  because  it 
says  that  worst-case  communications  distances  only  grow  like  the  order  of  the 
hypercube. 

•  Connectivity  is  a  measure  of  reliability  or  fault  tolerance  in  multiprocessor  net- 
works. The  connectivity  of  a  hypercube  is  equal  to  the  order  of  the  cube,  n. 
The  edge  connectivity  is  also  n  (each  node  has  n  incident  edges). 

Counting  the  number  of  nodes  in  a  hypercube  is  easy.  The  hypercube  construc- 
tion process  also  points  to  a  recurrence  relation  that  reveals  the  number  of  edges 
in  a  hypercube.  The  initial  case,  of  course,  is  the  hypercube  of  order  zero  with  no 
edges.  After  this,  the  number  of  edges  can  be  expressed  in  terms  of  the  size  of  the 
previous  cube.  Suppose  a  hypercube  of  order  n  has  q  edges.  Then  the  hypercube  of 
order  (n  +  1)  will  have  2q  -f  2n  edges.  This  is  because  the  construction  procedure 
calls  for  two  copies  and  2n  edges  between  them. 

Figure  C.4  provides  an  example.  This  is  the  graph,  //4,  of  the  hypercube  of 
order  four.  All  of  the  characteristics  given  above  are  evident.  Additionally,  a  Gray 
code  labeling  of  the  nodes  is  given.  The  recurrence  relation  above  is  useful,  but  it 
retains  a  dependence  upon  q.  A  more  convenient  formula  would  depend  on  n  alone. 

In  fact,  there  is  a  simple  formula  for  the  number  of  edges  in  the  graph  of  a 
hypercube,  but  it  requires  a  closer  look  at  the  recurrence  relation.  In  more  formal 
terms:  let  q(n)  represent  the  number  of  edges  in  a  hypercube  of  order  n.  Then: 

..JO  if  n  =  0 

9(n)~\  2q(n-l)  +  2^    if  n  >  1    " 

This  can  be  expanded  and  shown  equivalent  to:  q(n)  =  n(2^n-1^).  Table  C.2 
provides  an  example. 
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TABLE  C.2:  NODES  AND  EDGES  FOR  A  HYPERCUBE 


Order 

Number  of  Nodes 

Number  of  Edges 

0 

1 

0 

1 

21  =  2 

2(0)  +  2°  =  1 

2 

22  =  4 

2(1)  +  21  =4 

3 

23  =  8 

2(4) +  22  =  12 

4 

24  =  16 

2(12)  +  23  =  32 

5 

25  =  32 

2(32)  +  24  =  80 

6 

26  =  64 

2(80)  +  25  =  192 

7 

27  =  128 

2(192)  -r-26  =448 

(n-1) 

2<»-i) 

Q 

n 

2n 

2g  +  2(n-1> 
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Figure  C.4:  Graph  of  a  4-Cube 


H.   SOURCE  CODE  LISTINGS 


A  listing  of  the  Gray  code  generation  program  gray.c  follows. 
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jgray.c 


l 

2 
3 
4 
5 

6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
4  5 
46 
47 
48 
49 
50 


/* 

* 

* 
* 
* 
* 
* 
* 
* 

* 
* 
* 
* 
* 
*/ 


/* 

* 
* 
* 
* 
* 
* 
* 
* 
* 

* 
* 
* 
* 
* 
* 
* 
* 

* 
* 
* 
* 
* 
* 
* 

*/ 


PROGRAM  INFORMATION 


SOURCE 

VERSION 

DATE 

AUTHOR 

USAGE 

REFERENCES 


gray.c 

1.2 

01  August  1991 

Jon  Hartman,  U.  S.  Naval  Postgraduate  School 

gray 


[1]  Hamming,  Richard  W.   "Coding  and  Information  Theory",  2nd  edition, 
edition,  Englewood  Cliffs,  N.J.:  Prentice-Hall,  1986,  pp.  97-99. 


==============    DESCRIPTION    ============== 

This  program  generates  and  displays  the  Gray  code  described  in  [l] . 


ALGORITHM 


Consider  a  b-bit  Gray  code  beginning  at  zero.   Let  j  be  an  integral  index 
such  that  0  <=  j  <  b.   Consider  two  b-vectors,  mod_counterD  and  bin[] . 
Each  element,  mod_counter [j] ,  holds  a  count  mod  (2"(j+l)).   Initially  we 
shall  set  mod_counter [j]  =  (2"j).   Furthermore,  let  the  elements  of  bin[] 
represent  a  binary  number  in  the  natural  way.      That  is,  each  element, 
bin[j]  will  be  either  0  or  1 ,   and  binD  will  be  formed  so  that  the  sum, 
(  2*0  *  bin[0]  +  2*1  *  bm[l]  +  2*2  *  bin[2]  +  ...  ),  represents  the 
'value'  of  bin[] .  We  have  elected  to  start  the  code  at  zero,  so  let 
bin[]  be  set  to  zeros  initially.   Next  perform  this  algorithm: 

for  (i  =  0;  i  <  (2~b) ;  i++)  { 

Print  the  "binary  number"  represented  by  binD. 

for  (j  =  0;  j  <  b;  j++)  { 

Let  mod_counter [j]    =    (mod_counter [j]    +    1)  mod   (2"(j+l)) 

If  mod.counter [j]    ==  0,    then  toggle  the  bit   in  bin[j] 
(i.e.,    binCj]    =    (binCj]    X0R   1)    ). 

}  end  for(j) 

}   end  for(i) 
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gray.c 


•include  <6tdio.h> 


Hifndef  EXIT.FAILURE 
#define  EXIT.FAILURE 
#endif 


#ifndef  SUCCESS 
#define  SUCCESS 
tendif 


7  #define  P0W2(n) 


76 
79 
60 
61 
62 
63 
64 
65 
66 
67 
66 
69 
90 
91 
92 
93 
94 
95 
96 
97 
96 
99 
100 


main()  { 

int   patience  =  5; 

long  b  =  0, 
♦bin, 

i. 

J. 
1. 

♦mod.counter ; 


((1)  «  (n)) 


/*  there's  a  limit  to  my  patience1 

/*  as  in  b-bit  Gray  code 

/*  as  described  above 

/*  generic  integral  values 

/*  length  of  Gray  code  (2"b) 

/*  as  described  above 


*/ 

*/ 
*/ 
*/ 

*/ 
*/ 


printf ("\n\n\n\n\n\n =  =  =  =   ") ; 

print* ("This  program  generates  the  binary  numbers  of  a  Gray  code.   "); 
printf ("==== \n\n\n") ; 

printf ("     Successive  numbers  in  a  Gray  code  differ  in  exactly  "); 
printf ("one  bit  position. \n") ; 

printf ("     The  list  generated  by  this  program  will  be  complete.   "); 
printf ("That  is,  if  you\n"); 

printf ("     request  the  code  of  numbers  that  are  b-bits  long,  "); 
printf ("you  will  get  a  list\n"); 

printf ("     of  (2"b)  binary  numbers,  starting  with  zero.\n\n\n") ; 
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gray.c  

101  /*   The  sole  purpose  of  this  while()  loop  is  to  get  the  value  of  b   */ 

102  while  (b  <=  0)  { 

103 

104  printf("  Please   enter  desired  length   (binary  digits):    ") ; 

105  scanf  ("'/.d"  ,   ftb) ; 
loe  f f lu8h(stdin) ; 
107  printf ("\n\n") ; 

106 

109  if    (b  >  0)    {  /*     else  ask  again   (patience  permitting)      */ 

no 

in  1   =  P0W2(b); 

112 

113  if  (1  <=  0)  {  /*  guard  against  too  many  left  shifts!   */ 

114 

115  printf ("     The  acceptable  range  is  "); 

116  printf  ("1 .  .*/.d.   ",  (sizeof  (long)*8-2) ) ; 

117  printf ("Please  try  again. \n\n\n") ; 

116 

119  b    =    -1; 

120  > 

121  } 
122 

123  if    ( — patience   <=   0)    { 

124 

125  printf ("  Ran  out  of  patience !\n") ; 

126  exit(EXIT_FAILURE); 

127  } 

126  }   /*    end   while    (b  <=   0)    */ 

129 
130 

131  /*     Allocate   storage  for  the   arrays,    test  to  see  if   it  worked     */ 

132  bin  =    (long*)    calloc    (b,    sizeof (long) ) ; 

133  mod_counter   =    (long*)    calloc    (b,    sizeof (long)) ; 

134 

135  if    ((!bin)    II    ( !mod_counter))    { 

136 

137  printf  ("mainO  :      Allocation  failure  bin[]    or  mod_counter []  An")  ; 

136  exit(EXIT_FAILURE); 
139              } 

140 
141 

142  /*      Initialize  mod_counter []      */ 

143  for    (i   =  0;    l  <  b;    i++)     mod_counter [i]    =  P0W2(i); 

144 

145  printf  ("  Gray   code  for  '/.Id  bits  will  generate  ",   b) ; 

146  printf  ( '"/.Id  numbers  .  \n\n\n"  ,    1); 

147  printf ("  Press  RETURN  to  continue...."); 
146  fflush(stdin) ; 

149  i   =  getc(stdin) ; 

150  printf ("\n\n\n") ; 
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gray.c 

151     /*   Do  the  for()  loop  spoken  of  in  the  "ALGORITHM"  section  above   */ 
152 

153     for  (i  =  0;  i  <  1;  i++)  { 

154 

155  /*     Print   the  binary   representation  held   in  bin[]      */ 

156  printf ("\t") ; 

157 

156  for   (j    =    (b-1);    j    >=   0;    J—)    {  printf  ("*/.ld" ,   bin[j]);    > 

159 

160  printf ("\n") ; 

161 
162 

163  /*     Adjust   the  counters  using  addition  mod    (2"(j+l))    and  toggle  the 

164  *      corresponding  bit   in  bin[]    whenever  an  element   of   mod_counter[3 

165  *      reaches  zero. 

166  */ 

167  for    (j    =    0;    j    <  b;    j++)    { 

166 

169  mod_counter [j]++ ; 

170 

i7i  if    ((mod_counter[j]    '/.=   P0W2(j  +  l))    ==   0)     bin[j]    "=    1; 

172  > 

173  >   /*    end  for(i)    */ 

174 

175  free(bin); 

176  f ree(mod_counter) ; 

177 

176  return(SUCCESS)  ; 

179    } 

160   /* =  ====  =  =  ===  =  =  =  EOF     gray.c  ============= */ 
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APPENDIX  D 
A  SPARSE  MATRIX 

Partial  differential  equations  can  be  used  to  characterize  many  physical  prob- 
lems. Explicit  solutions  to  these  problems  are  often  quite  complicated,  so  alterna- 
tive approaches  warrant  our  attention.  Simple  matrices  exist  as  legitimate  repre- 
sentatives of  complex  problems.  A  system  of  linear  equations  can  be  constructed 
to  give  a  discrete  approximation  to  the  problem.  The  structure  of  the  physical 
setting  guarantees  that  the  corresponding  matrix  of  coefficients  will  be  sparse  and 
symmetric.  Why  does  this  happen?  When  do  we  have  the  right  to  expect  such  a 
simple  matrix?  Where  does  the  matrix  come  from  and  what  does  it  mean? 

This  discussion  explains  how  to  construct  the  matrix  of  coefficients  and  vec- 
tors that  describe  the  numerical  approximation  to  an  elliptic  partial  differential 
equation.  Poisson's  equation  in  two  dimensions  is  used  to  demonstrate  the  process. 
The  first  step  uses  a  finite  difference  approximation  to  produce  a  system  of  equa- 
tions. The  system  is  fine-tuned  and  the  matrix  of  coefficients  is  extracted.  The 
process  reveals  the  origins  of  structure  and  shows  why  the  matrix  is  sparse  and 
symmetric. 

A.   LAPLACE  AND  POISSON 

To  most  engineers,  mathematicians,  and  scientists,  Laplace  and  Poisson  are 
familiar  French  names.  Pierre-Simon  de  Laplace  (1749-1827)  and  Simeon  Denis 
Poisson  (1781-1840)  made  sizeable  contributions  to  several  fields.  In  a  moment,  the 
discussion  turns  to  partial  differential  equations  named  in  honor  of  these  gentlemen. 

If  the  material  seems  a  bit  difficult,  the  following  quote  from  [Ref.  42:    p.  10] 

may  provide  some  encouragement.  The  ideas  are  not  so  obvious  to  everyone  as  they 

may  have  been  to  Laplace. 

Nathaniel  Bowditch  (1113-1838),  an  American  astronomer  and  mathemati- 
cian, while  translating  Laplace's  Mecanique  celeste  in  the  early  1800s,  stated,  "I 
never  come  across  one  of  Laplace's  'Thus  it  plainly  appears'  without  feeling  sure 
that  I  have  hours  of  hard  work  before  me  to  fill  up  the  chasm  and  find  out  and  show 
how  it  plainly  appears." 
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The  next  several  pages  are  dedicated  to  showing  how  the  matrix  representation 
of  a  partial  differential  equation  plainly  appears*.  The  objective  is  to  describe  a 
particular  physical  problem,  then  convert  it  to  the  equivalent  matrix  representation 
using  a  deliberate,  step-by-step  approach. 

B.   EQUATIONS 

Laplace  and  Poisson  worked  with  partial  differential  equations  that  can  be  ob- 
served in  nature.  What  kinds  of  natural  phenomena  can  be  described  with  partial 
differential  equations?  This  section  gives  a  brief  answer  to  this  question.  The  dis- 
cussion includes  the  natural  setting,  the  equations,  and  a  quick  look  at  the  variables 
and  constants  involved.  The  link  between  the  equations  and  their  physical  meaning 
is  critical,  so  this  aspect  must  be  developed.  The  heat  equation  has  one  of  the  most 
intuitive  physical  interpretations  available,  so  it  is  used  as  a  starting  point.  After 
developing  a  general  perspective,  the  field  can  be  narrowed  to  a  particular  example — 
Poisson's  equation.  Such  a  limited  survey  of  partial  differential  equations  can  only 
hope  to  succeed  by  appealling  to  the  reader's  experience  and  intuition. 

1.    Heat 

Before  looking  at  a  partial  differential  equation,  let  us  recall  some  plane 
geometry.  The  intersection  of  a  plane  and  a  cone(s)  provides  many  interesting  shapes 
and  equations.  Consider  the  equation  that  describes  all  points  equidistant  from  a 
point  (focus)  and  a  line  (directrix): 

!,=  (!)*'  +  *.  (D.l) 

This  is  a  parabola  whose  focus  and  vertex  both  lie  on  the  t/-axis  (the  axis  of  the 
parabola  is  the  y-axis).  The  focal  length  is  c  and  the  vertex  is  located  at  (0,A:). 
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Partial  differential  equations  are  classified  using  conic  sections  much  like 
equations  in  the  rry-plane.  Introductions  to  partial  differential  equations  often  begin 
with  the  heat  equation: 

&-"£+«*■*>■  (D-2) 

This  is  an  example  of  a  parabolic  partial  differential  equation.  Note  the  similarity  of 
equations  (D.l)  and  (D.2). 

a.  Definitions  and  Notation 

The  heat  equation  describes  the  temperature,  u(x,t),  in  a  "thin  rod" 
(the  single  dimension  x  appears  in  the  equation).  The  presence  of  t  indicates  depen- 
dence upon  time.  If  there  is  a  heat  source  (or  sink)  present,  it  is  represented  by  Q. 
We  can  see  that  Q  may  be  a  function  of  x  or  i  or  both.  When  mass  density  (p), 
specific  heat  (s),  and  thermal  conductivity  (K)  are  known;  the  thermal  diffusivity, 
k,  can  be  determined  using  the  following  relation: 

k  =  —  (D.3) 

sp 

b.  Houses  and  Heat 

From  our  youth,  we  have  observed  several  important  properties  of  heat 
flow.  The  lessons  are  simple,  few  in  number,  and  can  be  observed  from  the  comfort 
of  our  home.  First,  heat  energy  only  flows  when  there  is  a  difference  in  temperature. 
If  the  temperature  outside  is  the  same  as  the  indoor  temperature,  no  heat  energy  will 
cross  the  threshhold  (even  with  the  door  open).  A  temperature  difference  represents 
an  instability  and  heat  will  flow  to  counter  this  situation. 

When  heat  does  flow,  it  goes  from  hotter  to  colder  regions.  The  loss  of 
heat  energy  from  the  warmer  region  reduces  the  temperature  there,  and  the  tem- 
perature in  the  colder  region  rises  as  it  gains  heat  energy.    The  transfer  of  heat 
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has  a  stabilizing  effect  (the  environment  will  not  be  at  rest  as  long  as  temperature 
differences  exist).  We  do  not  find  the  changes  in  temperature  surprising,  but  our 
conversation  indicates  confusion  concerning  the  direction  of  the  flow.  Most  of  us  have 
heard  someone  say:  "Close  the  door,  you're  letting  cold  air  in!".  We  understand  that 
this  statement  is  not  correct,  but  it  seems  to  persist  from  one  generation  to  the  next. 

In  addition  to  the  idea  that  heat  flows  in  the  presence  of  temperature 
differences  (gradients),  we  clearly  understand  that  larger  differences  are  related  to 
greater  heat  flow.  On  a  very  cold  Winter  day,  the  parent  notices  more  quickly  that  the 
child  left  the  door  open  (and  displays  more  urgency  in  shutting  it).  In  other  words, 
the  effect  of  heat  flow  is  to  balance  differences  in  temperature  and  it  somehow  "works 
harder"  when  there  is  a  greater  difference  to  balance.  In  mathematical  terms,  we 
would  suspect  (correctly)  that  heat  flow  is  proportional  to  temperature  difference. 

Finally,  we  recognize  an  ability  to  restrict  heat's  ever-present  balancing 
efforts.  Sometimes  we  want  an  imbalance  in  temperature,  and  we  often  use  insulation 
to  maintain  this  imbalance.  When  we  shut  the  door,  we  expect  that  it  will  slow 
the  transfer  of  thermal  energy  through  the  doorway  and  enable  us  to  maintain  an 
acceptable  imbalance  in  temperature.  For  the  same  reason  we  use  special  materials 
in  the  construction  of  refrigerators  to  keep  heat  out,  and  in  ovens  to  keep  heat  energy 
inside.  This  means  that  the  effectiveness  of  heat  transfer  is  subject  to  properties  of 
the  medium  (air,  glass  windows,  fiberglass  insulation,  wood  doors,  steel,  styrofoam, 
and  so  on)  through  which  it  flows. 

c.    Heat  Flux 

The  right-hand  side  of  the  heat  equation  looks  a  bit  complex,  but  it 
merely  captures  this  idea  of  heat  flow.  Before  tackling  the  second  partial  derivative 
of  u  with  respect  to  i,  think  about  the  first  partial  derivative.  The  first  partial 
derivative  of  u  with  respect  to  x  (scaled  by  the  thermal  conductivity,  K)  describes 
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movement  of  thermal  energy.  This  flow  of  heat  is  usually  called  heat  flux,  denoted 
<f>,  and  can  be  calculated  using  Fourier's  law  of  heat  conduction: 

*-- *g  (D.4) 

Heat  flux  is  a  measure  of  how  much  thermal  energy  per  unit  time  is 
moving  to  the  right  per  unit  surface  area  (by  convention,  flow  to  the  left  is  assigned 
a  negative  value  and  flow  to  the  right  is  positive)  [Ref.  43:  p.  3].  The  second  partial 
derivative  measures  changes  in  flux  with  respect  to  position.  In  other  words,  it 
represents  increasing  or  decreasing  flux. 

d.    Heat  Equation  Summary 

Let  us  carefully  reassemble  the  pieces  of  the  heat  equation  (D.2)  to  see 
if  the  theory  agrees  with  experience.  Temperature  has  spatial  and  temporal  depen- 
dencies. The  left-hand  side  describes  changes  in  temperature  over  time.  Changes  in 
heat  flux  are  captured  in  the  second  partial  of  u  that  appears  on  the  right-hand  side. 
Flux,  heat  energy  in  motion,  acts  to  equalize  temperature.  The  thermal  diffusivity, 
k,  measures  the  material's  resistance  to  heat  flux.  That  is,  a  temperature  difference 
activates  the  flow  of  heat  but  the  speed  and  effectiveness  of  this  flow  is  moderated  by 
material  properties.  Considering  everything,  then,  the  heat  equation  can  be  stated 
in  one  (long)  sentence:  Changes  in  temperature  over  time  are  caused  by  (equal  to, 
due  to,  related  to)  changes  in  heat  flow  (moderated  or  accelerated  by  properties  of 
the  material)  and  thermal  source(s). 

2.    Notation 

With  two  or  more  dimensions,  the  same  equations  that  looked  simple  in  one 
dimension  can  begin  to  look  complex.  The  linear  operator,  A,  is  used  to  simplify 
the  notation.  For  example,  Au,  substituted  into  the  right-hand  side  of  (D.2),  gives 
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the  heat  equation  a  new  look: 


-£  =  KAu  +  Q(xJ)  (D.5) 


This  is  a  more  general  equation  since  the  linear  operator  Aw  can  be  applied  in  any 
number  of  dimensions.  For  instance  (in  three  dimensions), 

d2u       d2u       d2u  ._   _N 

A"  =  ^  +  ^  +  ^  (D6) 

Sometimes  this  operator  is  called  the  Laplacian  of  u  and  some  authors  use  the  del 
operator,  V,  in  these  equations  (V2ix  =  Au). 

3.  Diffusion 

The  behavior  of  thermal  energy  is  actually  a  special  instance  of  diffusion, 
so  (D.5)  is  often  referred  to  as  the  diffusion  equation.  With  an  appropriate  substi- 
tution for  k,  the  equation  might  describe  the  spreading  of  dye  through  ocean  water. 
In  an  agricultural  application,  it  could  characterize  water  or  chemical  penetration 
in  soil.  We  shall  continue  to  use  the  term  "heat  equation",  though,  for  the  sake  of 
consistent  terminology  and  notation. 

4.  Laplace's  Equation 

Consider  the  effect  of  a  few  restrictions  on  the  heat  equation.  Suppose  that 
there  is  no  source  of  thermal  energy  (Q  =  0)  and  the  physical  properties  of  the 
material  do  not  vary  (k  is  constant).  Finally,  what  happens  if  the  time-dependency 
is  removed? 

The  left-hand  side  of  the  equation  goes  away.  This  is  not  so  unrealistic. 
Systems  may  reach  a  steady  (equilibrium)  state  after  a  time  (especially  in  the  absence 
of  sources).  We  can  divide  through  by  k  (assuming  k  ^  0)  and  the  equation  becomes: 

d2u       d2u 

Au  =  ^  +  v=0  (D-7) 
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This  is  Laplace's  equation  in  the  two  dimensions  x  and  y.  Sometimes  it  is  called 
the  potential  equation  since  it  also  describes  the  cases  in  which  u  stands  for 
gravity  or  voltage.  It  can  also  describe  "steady-state  heat  flow. . .  hydrodynamics, 
gravitational  attraction,  elasticity,  and  certain  motions  of  incompressible  fluids". 
[Ref.  44  :    pp.  660-661] 

5.   Ellipses 

Although  Laplace's  equation  seems  like  a  steady-state  heat  equation,  it  is 
fundamentally  different.  It  falls  in  the  elliptic  class  of  partial  differential  equations. 
Consider  an  ellipse  centered  at  the  origin  with  foci  (on  the  z-axis  at  a  distance  of  c 
from  the  origin)  located  at  (— c,  0)  and  (c,  0).  Suppose  that  the  foci  are  labeled  F\ 
and  F2.  The  major  axis  passes  through  the  center  and  through  the  foci,  connecting 
two  vertices  positioned  at  (— a,0)  and  (a,0).  The  minor  axis  passes  through  the 
center  perpendicular  to  the  major  axis  and  connects  the  vertices  at  (0,  —6)  and 
(0,6).  The  major  axis  deserves  its  name  since  a  >  b  (in  the  case  of  equality  the 
ellipse  degenerates  and  we  get  a  special  case — the  circle). 

For  any  arbitrary  point,  p,  let  the  distance  d\  be  the  distance  from  p  to  Fi 
and  let  d2  be  the  distance  from  p  to  F2.  Furthermore,  let  d  =  dx  +  d2.  The  ellipse 
is  described  by  all  points  satisfying  d  =  2a,  where  a  is  the  constant  length  of  the 
ellipse's  semi-major  axis  as  described  above.  The  standard  form  for  the  equation  of 
this  ellipse  is 

Using  the  distances  from  this  ellipse,  a  right  triangle  can  be  formed  with  sides  of 
length  b  and  c  and  hypotenuse  of  length  a.  This  means  a,  6,  and  c  are  related  by  the 
Pythagorean  Theorem. 


162 


1 

H 

u  =  g(x,y) 

i 

u  =  g{x,y) 

Aw  =  f{x,y) 

h  =  g(?,y) 

0 

u  =  g{x,y)                     L 

Figure  D.l:  The  Region 
6.    Poisson's  Equation 

We  have  discussed  several  partial  differential  equations  and  observed  the 
impact  of  changing  a  few  parameters.  Laplace's  equation  showed  what  happens  in 
the  steady-state  case  when  sources  are  removed  and  the  thermal  diffusivity  is  non- 
zero. Now  we  return  to  the  more  general  problem  that  can  be  represented  in  the 
presence  of  a  source,  sometimes  called  a  driving  (or  forcing)  function,  say  f(x,y). 

The  result  is  Poisson's  equation  (shown  here  in  two  dimensions): 


.  d2u      d2u 


(D.9) 


Again,  u(x,y)  typically  represents  temperature  or  voltage.  Laplace's  equation  (D.7) 
is  just  the  special  case  of  Poisson's  equation  (D.9)  where  f{x,y)  =  0.  The  rest  of 
the  discussion  will  focus  on  Poisson's  equation  within  the  rectangular  region  (shown 
in  Figure  D.l):  0  <  x  <  L,  0  <  y  <  11. 
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Figure  D.2:  Subdividing  the  Rectangle 
7.    Final  Assumptions 

We  shall  assume  that  the  conditions  along  the  boundaries  are  known  and  are 
given  by  v  =  g(x,y).  The  problem  is  solved  in  the  presence  of  a  forcing  function  /. 
The  goal  is  to  produce  something  that  a  computing  machine  can  "solve".  To  reach 
this  position,  several  steps  are  required.  First,  the  domain  is  divided  into  many 
smaller  regions.  Using  this  subdivision  scheme,  a  system  of  equations  is  developed. 
The  information  that  is  known  (/  and  g)  can  be  moved  to  the  right-hand  side  of  the 
system.  The  system  can  then  be  represented  in  typical  Ax  =  6  fashion. 

C.   DISCRETIZATION 

Before  attempting  a  numerical  solution,  the  domain  must  be  subdivided  into  a 
finite  (but  probably  large)  number  of  elements.  Figure  D.2  provides  an  illustration 
of  what  this  mesh  looks  like.  We  should  not  forget  that  actual  applications  may 
involve  100  (or  more)  divisions  in  each  direction.   Nevertheless,  (artificially)  small 


164 


examples  are  quite  sufficient  for  conveying  notation  and  measures  within  the  region. 

1.  Notation 

A  clear  understanding  of  the  problem  domain,  conventions,  and  notation 
is  prerequisite  to  developing  the  system  of  equations.  Consider  Figure  D.2.  This 
domain  will  serve  as  a  reference  for  the  upcoming  discussion  on  conventions  and 
notation. 

The  rectangular  region  has  length  L  =  9  and  height  H  —  5.  It  has  been 
subdivided  into  45  smaller  elements  by  a  mesh  made  of  four  horizontal  lines  and  eight 
vertical  lines.  The  integers  m  and  n  are  used  to  keep  track  of  how  many  horizontal 
and  vertical  dividing  lines  are  used  (here  m  =  A  and  n  =  8).  Each  element  has  length 
h  (in  the  x-direction)  and  height  k  (in  the  y-direction).  In  this  particular  example, 
the  elements  are  (conveniently)  square  with  h  =  k  =  1.  In  general,  the  individual 
elements  within  the  region  are  rectangular  (it  is  not  necessarily  true  that  h  =  k). 

The  elements  within  the  region  are  uniformly  spaced  (each  has  the  same 
size).  L,  H,  h,  and  k  do  not  need  to  be  integers — they  can  be  any  convenient  units. 
To  guarantee  uniform  spacing,  of  course,  L  and  H  must  be  integer  multiples  of  h 
and  k,  respectively.  That  is: 

L  =  (n  +  l)fc,       n   G    {0,1,2,3,...} 
F  =  (m  +  l)fc,       m   £    {0,1,2,3,...} 

2.  Internal  Mesh  Points 

Our  goal  is  a  system  of  equations,  and  ultimately  a  problem  stated  in  terms 
of  a  matrix  and  vectors.  We  will  eventually  see  that  there  are  mn  equations  in  mn 
unknowns,  one  for  each  internal  mesh  point  (where  the  lines  cross).  Imagine  elements 
of  size  h  x  k  (as  before)  that  are  centered  on  these  points,  such  as  the  cross-hatched 
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element  at  (7,3).  Each  equation  in  the  system  will  correspond  to  one  of  these  line- 
crossings  and  represent  one  of  these  elements.  It  is  useful  to  label  the  lines  for 
reference  purposes.  To  accomplish  this,  we  use  the  (integer)  counters  i  and  j. 

These  counters  are  used  to  reference  particular  vertical  and  horizontal  di- 
viding lines.  The  i  counter  refers  to  a  vertical  line  (1  <  i  <  n)  and  the  horizontal 
lines  are  indexed  by  j  (1  <  j  <  m).  Figure  D.2  may  be  deceptively  simple  due  to 
the  element  dimensions  h  =  k  =  1.  Because  of  this,  i  =  7  indicates  an  x-coordinate 
of  7  and  j '  =  3  means  y  =  3.  But  the  counters  i  and  j  are  not  generally  equivalent  to 
x-  and  y-position  in  the  coordinate  system.  Given  h,  k,  i,  and  j  the  corresponding 
coordinates  are  (x,y)  =  (ih,jk). 

D.  A  SYSTEM  OF  EQUATIONS 

The  next  step  is  to  build  a  system  of  mn  equations  that  describes  the  problem. 
First,  we  need  to  agree  upon  a  referencing  scheme  for  the  internal  mesh  points.  The 
numbering  will  be  based  upon  i  and  j  as  defined  above.  This  numbering  scheme 
begins  at  the  bottom  left  (i.e.,  i  —  j  —  1),  proceeds  up  the  first  column  and  then 
moves,  column-by-column,  to  the  right.  Specifically,  the  points  will  be  assigned  a 
label 

t  =  m(i-  l)+j  (D.10) 

Given  the  values  i  and  j  for  any  internal  point,  now  we  can  assign  it  a  label 
(1  <  £  <  mn).  Figure  D.3  shows  values  of  i  along  the  rr-axis,  values  of  j 
along  the  y-axis,  and  labeling  of  internal  mesh  points  according  to  (D.10). 

1.   Finite  Differences 

The  approach  calls  for  analyzing  each  internal  mesh  point.  Figure  D.4 
shows  the  point  referenced  by  i  and  j  and  its  neighbors  to  the  North,  South,  East, 
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Figure  D.3:  Numbering  the  Equations 

and  West.  We  use  a  centered  finite  difference  method  to  approximate  the  partial 
derivatives  in  (D.9)  and  arrive  at  the  equations  for  these  points.  The  finite  difference 
approximations  for  the  partial  derivatives  are: 


dx2(ij) 


h2 


(D.ll) 


d2u 


u,i<7_i  -  2u,j  +  Uij+i 


(D.12) 


The  approximation  for  the  partial  derivative  in  the  x-direction  (D.ll)  con- 
siders the  neighbor  to  the  West,  the  point  itself,  and  the  neighbor  to  the  East. 
Similarly,  the  approximation  in  the  y-direction  (D.12)  recognizes  neighbors  to  the 
South  and  North  in  addition  to  the  point.  Both  finite  difference  approximations 
favor  the  center  point  (i,  j),  giving  it  twice  the  weight  of  its  neighbors. 

Substituting  these  into  Poisson's  equation  (D.9)  yields: 

-  ("■•-'■'  -  2*r + U,+,J  >  -  (""- "  ^ + UW1  >  -  -*«« = -f.,   am) 
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Figure  D.4:  Neighbors  to  the  North,  South,  East,  and  West 

The  forcing  function,  /,-j,  is  known  so  (D.13)  begins  to  look  like  one  of  many  equa- 
tions in  a  linear  system.  There  is  such  an  equation  for  every  internal  mesh  point. 
To  make  sure  that  we  consider  all  of  the  internal  mesh  points  in  an  orderly  fashion, 
we  may  number  them  as  in  Figure  D.3  and  consider  them  one  at  a  time. 

2.    More  Equations 


At  this  point,  we  know  the  general  form  (D.13)  for  each  of  the  equations 
that  must  be  considered.  The  matrix  of  coefficients  may  not  be  completely  clear  yet, 
so  let  us  consider  each  of  the  equations  in  the  order  of  their  labels.  For  now,  we  will 
leave  the  i,j  subscripts  on  everything: 

-( r, )  -  ( n )  ~  ~/i,i 


h2 


k2 


U0.2  -  2uli2  +  tZ2,2  Uhl  -  2Ui,2  +  lXlt3 

-( p )  -  ( p )  «  -ha 


16S 


-f 


l,m-l 


-(  75  )  -  (  jr2  )  ~  -/1, 


,!*!,!   -  2t/2,l  +  t/3.1  ^         ,1*2,0  ~  2li2,i   +  K2,2^  , 


UL2  ~  2l/2.2  +  ^3,2^  1X2,1   ~  2U2,2  +  ^2.3x 


h2 


k2 


"/: 


2,2 


ful,m-1   ~  2?/2,m-l   +  ^3.m-l  v,         ^2,m-2  —  2^2, m-l   +  ^2,m  > 


h2 


k< 


'j2,m-\ 


-( n )  -  ( n )  *  -/a.. 


/*2 


F 


,Mn-2,l   ~  2^n-l,1   +  tin,l  ^         ,Un-lfl  ~  2un-l,l   +  ^n-1,2^  , 


1,1 


/^n-2,2  ~  2un-i,2  +  t/n,2x     _   ,"n-l,l  ~  2t/n-i,2  +  "n-1,3^ 


U 


~" /n-1,5 


( ro )  -  ( ro )  ~  -/n-l,m-l 


h2 


k2 
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,ti„_i,i  -2unil  +un+lil  un,0  -  2un,i  +  uni2 

( p )  "  ( p )  *  -/n,, 


,«n-l,2  -  2Un,2  +  Un+l,2x  /  "n,l   ~  2un>2  +  Un,3 

■( jji )  -  ( p )  *  -/o 


/^n-l,m-]    —  2un,m_i   +  Un  +  l,m-l  x  /  wn.m-2  ~~  2unm_i  -j-  Unim 

~" I T~2  )  ~  \  p  )  ~   —  Jn,m-] 


3.    Modification 

The  goal  is  to  determine  uf-j  for  all  internal  points  (i.j).  Having  completed 
several  foundational  steps,  we  can  see  a  developing  system  of  mn  equations.  Let's 
clean  it  up  a  bit.  To  do  this,  we  need  to  make  better  use  of  one  more  piece  of  the 
given  information — the  boundary  values.  For  those  points  just  inside  the  boundaries 
(a  horizontal  distance  of  h  from  the  sides  and/or  a  vertical  distance  of  k  from  the 
top  or  bottom)  we  already  know  part  of  the  left  side  of  (D.13).  In  particular,  any 
subscript  i  =  0,  j  =  0,  i  =  n  -f-  1,  and/or  j  =  m  +  1  signifies  a  (known)  boundary 
point. 

Multiplying  through  by  (hk)2  and  moving  the  known  information  to  the 
right-hand  side  of  the  equations,  we  again  start  with  the  left-most  column  (i  =  1) 
and  work  in  the  order  of  the  labels.  Now  the  system  of  equations  looks  like  this: 
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fc2(2wi,!  -  u2,i)  +  /i2(2uu  -  uli2)  £s  -{hk)2flA  +  Ar2w0.i  +  ><2"i,o 


/r2(2u1>2  -  1/2,2)  +  ^2(~^i,i  +  2ult2  -  ui, 3)  ~  -{hk)2flt2  +  k2u0<2 


fc2(2Wi,m_]    -  W2.m-l)  +  fc2(— «l,m-2  +  2uliTn_!    -  Ui,m)   W   -(MjVl.m-l   +  ^2U0tm-l 


fc2(2tti,m  -  U2,m)  +  ^2(-«l,m-l  +  2lii,m)  «  -(hk)7fUm  +  A'2U0,m  +  /l2Mi,m  +  i 


fe2(-ulfl  +  2u2,i  -  W3.1)  +  ^2(2w2,i  -  w2l2)  w  -{hk)2f2A  +  /i2w2i0 


fc2(-«i,a  +  2^2,2  -  "3,2)  +  fc2(-U2,i  +  2u2,2  -  1*2,3)  ~  -{hk)2f2i2 


k    (-Ul,m-1   +  2w2,m_l   -  W3,m-l)  +  ^    (-W2,m-2  +  2w2,m_i   -  U2,m)  «  ~(M)    /2,m-l 


^2(-«l,m  +  2w2,m   -  W3,m)  +  ^2(-«2,m-l  +  2u2,m)  «  -{hk)2f2,m  +  /l2U2,m  +  l 


fc2(-Un_2,i  +  2un_u  -  U„,i)  +  fca(2uB_i,i  -  Un-1,2)  «  -(^*)2/n-l,l  +  fc2«n-l,0 


fc2(-wn-2,2  +  2un_1<2  -  un,2)  +  fe2(-uB-i,i  +  2un_1>2  -  w„_i,3)  w  -{hk)2fn_h2 
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k  (-un_2im_1+2un_i<m_i-i/n,m_i)+/i  (-un_iim_2+2un_iim_1-un_lim)  «  ~{hk)  /»_i,m-i 

A-2(-Wn_2,m-(-2lZn_lim-Unim)  +  /i2(-Un_lim_1+2un_lim)   «   -(Wf/n-l,m  +  ^"n-l,m+l 

fc2(-un-i,i  +  2un,i)  +  h2{2un<l  -  un>2)  «  -(M)2/n,i  +  A-2un+lil  +  /i2un,0 
^2(-wn-i,2  +  2un,2)  +  £2(-unil  +  2un>2  -  un,3)  «  -{hk)2fna  +  fc2un+i,2 


A-2(-z/rj_1,m_1+2L(ri,m_1)  +  /i2(-i/n,T11_2  +  2uniTT,_1-t/ritm)  w  -(M)2/n,m-i  +  fc2Un+i,m-i 
A-2(-tin_1>m  +  2un>m)  +  /i2(-wn,m-i  +  2un,m)  %  -{hk)2fn,m  +  fc2un+i,m  +  /*2un,m+i 

Now  the  equations  are  very  close  to  what  we  want.  There  are  some  unfor- 
tunate side  effects  to  such  a  deliberate  approach.  The  list  of  equations  is  tedious, 
the  subscripts  are  a  bit  involved,  and  it  takes  some  concentration  to  match  things 
up.  There  are  some  benefits,  though,  for  those  who  can  endure!  It  will  take  very 
little  effort  to  see  how  the  coefficients  are  collected. 

E.   MATRIX  REPRESENTATION 

It  is  not  hard  to  translate  the  preceding  equations  into  the  familiar  representa- 
tion Ax  =  b.  Notation  is  quite  important.  We  will  start  with  the  obvious,  exchanging 
u  for  x  so  that  (eventually)  the  system  will  look  like  Au  =  b.  Dimensions  are  impor- 
tant too.  The  goal  is  a  large,  sparse,  symmetrix  matrix  A  6  Rmn  x  mn.  The  vectors 
u  and  b  have  the  obvious  dimensions  and  are  assumed  to  contain  real  numbers  as 
well. 
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1.    Unknowns 

Since  there  is  a  great  deal  of  structure  in  this  problem,  it  is  useful  to 
partition  the  vector  of  unknowns,  u.  Let  uli}  have  the  same  meaning  as  it  did 
in  equation  (D.13)  and  consider  the  m-vector: 


u.,2 


ut  — 


ui,m-l 


This  vector  captures  all  of  the  unknowns  for  a  given  column,  i,  of  the  original  region. 
Now  we  can  stack  the  columns,  n  in  number,  forming  the  entire  vector  u  of  unknowns: 


u  = 


This  process  has  clearly  formed  u  £  3?mn.  Now  we  turn  to  the  matrix  of  coefficients. 

2.    Coefficients 

The  matrix  A  is  formed  by  combining  two  smaller  matrices,  T  and  D.  First 
we  shall  consider  the  tridiagonal  matrix  T  G  3?rT1Xm.  For  aesthetic  purposes  only,  let 
the  diagonal  elements  of  T  be  d  =  2(h2  -f  k2). 


T  = 


d      -h2 
-h2       d      -h2 

-h2       d      -h2 

-h2       d      -h2 

-h2       d      -h2 
-h2       d 


Next,  consider  the  diagonal  matrix  D  G  3ftr 
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-k' 


-k 


,2 


D  = 


-k 


1.2 


Forming  the  matrix  A  requires  n  identical  copies  of  Tand  2(n  —  1)  identical 
copies  of  D.  The  matrices  in  A  below  are  assigned  subscripts  for  counting  purposes. 
The  matrix  subscripts,  by  the  way,  denote  a  value  of  i  corresponding  to  the  partition 
ut  which  the  matrix  will  multiply.  A  is  the  block-tridiagonal  matrix 

Di     T2     D3 

D2     T3      D4 


A  = 


Dn-3       ^n-2       Dn_i 

Dn-2       7n-l       Dn 


3.    Knowns 


We  could  proceed  immediately  to  the  solution  vector,  b  E  5?mn,  using  the 
equations  provided  in  the  previous  section.  Again,  though,  the  result  can  be  cleaned 
up  a  bit  if  we  form  6  as  the  sum  of  three  vectors  /,  v,  w. 

The  vector  /  E  3ftmn  represents  the  forcing  function.  The  equations  clearly 
indicate  where  the  scalar  multiplier  comes  from. 

/l,2 


f=-(hkf 


/l,m-l 
/l,m 
/2,1 
/2,2 


Jn,m  —  \ 
Jn,rn 
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Next,  the  vector  v  €  3?m"  is  used  to  represent  the  information  that  is  known 
due  to  the  boundary  values  on  the  East  and  West  sides  of  the  region. 


v  =  k' 


^0,2 


u0,m-\ 
u0,m 

0 


0 

Wn+l,l 
wn+l,2 

un+l,m-l 

Finally,  the  vector  w  £  5?mn  is  used  to  represent  the  information  that  is 
known  due  to  the  boundary  values  on  the  North  and  South  sides  of  the  region. 


«i,o 
0 


0 

wl,m  +  l 
^2,0 

h2        o 
0 

"2,m  +  l 
"3,0 

Now  6  is  a  simple  sum  of  these  vectors:  6  =  /  -f  v  -f  w. 
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F.   CONCLUSION 

This  process  has  shown  a  few  examples  of  partial  differential  equations  that 
appear  frequently  in  nature.  Poisson's  equation  in  two  dimensions  was  selected  as 
an  example.  After  the  finite  difference  approximation  is  selected,  determining  the 
system  of  equations  is  a  tedious  (but  not  too  complicated)  process.  Once  the  system 
of  equations  is  written  down,  the  matrix  representation  is  easy  to  come  by. 
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APPENDIX  E 
HYPERCUBE  COMMUNICATIONS 

This  report  displays  the  results  of  point-to-point  communications  tests  that 
were  performed  on  the  Intel  iPSC/2  hypercube.  The  emphasis  of  the  experiment 
was  to  evaluate  several  aspects  of  communications  time.  The  exercise  showed  that 
communication  on  this  machine  is  virtually  independent  of  the  Hamming  distance 
between  communicating  nodes.  There  is  clear  evidence  that  transmission  rates  are 
related  to  message  length  (the  transmission  system  favors  longer  messages)  due — at 
least  in  part — to  an  overhead  charged  to  begin  the  communication.  Communications 
between  the  host  and  a  node  never  achieve  the  rate  that  can  be  realized  with  node- 
to-node  transmissions. 

The  communications  test  code  described  in  this  appendix  was  only  executed  on 
the  iPSC/2.  Time  did  not  permit  modification  of  the  code  and  testing  on  the  trans- 
puter networks.  A  thorough  test  of  communications  and  computational  abilities  of 
the  T414  and  T800  transputers  has  already  been  performed  by  Gregory  Bryant.  His 
masters  thesis  [Ref.  26]  contains  the  documentation  of  this  work.  A  short  summary 
of  Bryant's  findings  is  included  in  the  conclusions  to  this  appendix. 

A.   SOURCE  CODE  OVERVIEW 

The  host  program  (commtst.c)  and  a  node  program  (commtstn.c)  contain 
most  of  the  code  for  this  experiment.  There  is  also  a  header  file,  commtst.h,  shared 
by  these  codes,.  Finally  (but  perhaps  most  important  for  any  high-level  survey  of  the 
code),  the  makefile  commtst.mak  shows  dependencies  and  compilation  procedures. 
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In  the  discussion  that  follows,  bold-faced  type  is  used  to  indicate  function  and  object 
names  that  actually  appear  in  the  code. 

B.   STRATEGY 

The  program  must  define  the  valid  arguments.  The  function  interpret_args() 
takes  care  of  checking  for  occurrences  of  these  arguments  in  the  command  line. 
When  the  arguments  have  been  interpreted,  we  know  how  to  set  variables  like  reps 
(repetitions),  bytes  (length  of  the  message  to  be  passed),  and  verbose  (to  control 
how  much  data  is  spewed  out).  Once  these  values  are  known,  the  host  instructs  each 
node  to  either  RECEIVE  or  SEND.  A  special  Tasking  packet  (structure)  carries 
instructions  to  each  node  independently.  Only  one  node  is  designated  to  SEND 
at  any  one  time;  the  rest  RECEIVE.  Receivers  simply  crecv()  the  given  number 
of  bytes  and  return  the  message  to  the  originator  by  calling  csend().  Since  this 
involves  a  round-trip,  the  issue  of  timing  requires  attention. 

We  can  divide  the  time  measurement  by  two  (to  account  for  the  round-trip), 
provided  we  aren't  deceived  by  the  outcome.  That  is,  passing  two  fr-byte  messages  is 
not  the  same  as  passing  a  single  message  of  length  26  bytes.  To  make  the  timing  data 
credible,  however,  the  round-trip  method  is  essential.  The  precision  of  the  mclock() 
function  is  an  additional  issue.  At  best,  mclockQ  is  accurate  to  the  millisecond  (and 
ten  milliseconds  may  be  a  more  reasonable  expectation).  Very  short  messages  can 
produce  questionable  results  in  terms  of  the  precision  of  the  timing  data. 

For  this  reason,  tests  of  short  messages  should  be  repeated  a  number  of  times 
within  the  block  surrounded  by  time  checks.  This,  of  course,  revives  the  same  issue 
(multiple  repetitions  of  a  message  are  not  equivalent  to  a  single,  longer  message). 
We  may  proceed,  however,  provided  we  establish  a  common  understanding  of  the 
problem  domain  and  terminology.  I  have  used  the  term  effective  time  to  capture  this 
subtlety. 
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Wherever  this  term  appears,  it  should  be  interpreted  according  to  the  following 

definition: 

t 

where  te  is  the  effective  time,  t  is  the  actual  time  measurement  for  the  message,  and  p 
is  the  number  of  repetitions.  The  factor  of  two  is  included  to  account  for  the  round- 
trip.  For  instance,  suppose  that  the  user  asks  for  three  repetitions  of  a  message.  The 
implementation  carries  this  out  in  a  for  loop.  Time  is  sampled  before  and  after  the 
loop.  The  inside  of  the  loop  is  the  simple  csendQ  and  crecvQ  sequence  described 
earlier.  The  effective  time  in  this  example  would  be  te  =  r/6. 

In  summary,  there  is  no  convenient  (and  credible)  method  for  timing  one-way 
communications.  If  we  time  one-way  communications,  the  results  could  be  mis- 
leading in  that  we  could  not  be  certain  that  the  clock  was  starting  just  before  the 
beginning  of  the  csendQ  and  stopped  immediately  after  the  receiving  node  accu- 
mulated the  final  byte  of  the  message.  We  must  also  consider  the  issue  of  blocking 
communication.^  Thus,  the  (round-trip)  method  is  not  so  easily  misled  by  the  fact 
that  csendQ  is  not  actually  blocking.  The  transmission  duties  are  quickly  handed 
over  to  a  communication  manager  and  processing  continues  directly.  The  crecvQ 
enforces  blocking  communications  and  execution  stops  at  this  function  until  the  last 
byte  has  been  acquired.  Thus  the  round-trip  method  seems  to  be  quite  reliable, 
particularly  in  the  case  of  node-to-node  communications  (if  the  host  is  involved,  the 
results  are  less  consistent). 

Since  receiver  nodes  have  nothing  else  to  do  but  receive  and  retransmit  the 
message,  the  performance  loss  due  to  the  round-trip  method  should  be  (almost  en- 
tirely) accounted  for  by  two  factors  (loosely)  placed  into  "software"  and  "hardware" 


^y  definition,  blocking  means  that  the  invoking  process  (send  or  receive)  causes  execution  of 
the  program  to  stop  (be  blocked  from  the  CPU)  until  the  communications  requirement  has  been 
satisfied. 
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categories: 

•  Software  overheads  like  establishing  and  freeing  the  activation  stack  for  functions 
(e.g.,  the  csend()  and  crecvQ  functions). 

•  Hardware  overheads  associated  with  establishing  the  communication  path  and 
performing  switching.  The  take-down  time  for  this  task  is  probably  negligible. 

Hence,  if  this  method  of  analyzing  communications  performance  errs,  it  does  so  on 
the  conservative  side.  That  is,  the  timing  used  in  this  method  is  liberal  (if  anything), 
so  that  communication  rates  will  be  estimated  conservatively. 

C.   RESULTS 

Considering  the  nature  of  the  implementation,  communications  will  be  consid- 
ered bidirectional.  In  particular,  the  term  "host-to-node"  communications  does  not 
imply  that  the  host  is  the  originator  of  directed  communication,  but  that  a  bidirec- 
tional exchange  takes  place  between  some  node  and  the  host.  The  host  does  send 
directed,  one-way  instructions  to  the  nodes,  but  all  timed  communication  originates 
at  a  node  and  returns  to  that  node  (even  if  it  goes  to  the  host).  There  are  essentially 
three  groups  of  results;  each  of  which  captures  data  for  node-to-node  communica- 
tions and  host-to-node  communications. 

1.    Small  Messages  Repeated  Ten  Times 

The  first  test  involved  messages  of  length  (  <  1,024  bytes.  Since  the 
shortest  of  these  would  not  generate  trustworthy  timing  data,  the  repetition  count, 
p,  was  set  at  ten.  This  gave  te  =  t/20.  Table  E.l  shows  the  results. 
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TABLE  E.l:  SHORT  MESSAGES  WITH  TEN  REPETITIONS 


Message 
Length 

Node-to-Node 

Host-to- 

Node 

t 

te 

Rate 

t 

te 

Rate 

(Bytes) 

(msec) 

(msec) 

(kbytes/sec) 

(msec) 

(msec) 

(kbytes /sec) 

1 

7.10 

0.36 

2.75 

71.40 

3.57 

0.27 

2 

7.00 

0.35 

5.58 

79.40 

3.97 

0.49 

4 

7.00 

0.35 

11.16 

78.90 

3.95 

0.99 

8 

7.00 

0.35 

22.32 

75.80 

3.79 

2.06 

16 

7.20 

0.36 

43.40 

78.10 

3.91 

4.00 

32 

7.30 

0.37 

85.62 

79.40 

3.97 

7.87 

64 

7.70 

0.39 

162.34 

87.10 

4.36 

14.35 

128 

13.90 

0.70 

179.86 

132.10 

6.61 

18.93 

192 

14.30 

0.72 

262.24 

134.60 

6.73 

27.86 

256 

14.70 

0.71 

340.14 

137.50 

6.88 

36.36 

320 

15.30 

0.77 

408.50 

139.60 

6.98 

44.77 

384 

15.80 

0.79 

474.68 

142.40 

7.12 

52.67 

448 

16.20 

0.81 

540.12 

147.40 

7.37 

59.36 

512 

16.70 

0.84 

598.80 

180.30 

9.02 

55.46 

576 

17.10 

0.86 

657.89 

201.50 

10.08 

55.83 

640 

17.60 

0.88 

710.23 

207.00 

10.35 

60.39 

704 

18.10 

0.91 

759.67 

208.80 

10.44 

65.85 

768 

18.50 

0.93 

810.81 

204.50 

10.23 

73.35 

832 

19.00 

0.95 

855.26 

180.00 

9.00 

90.28 

896 

19.40 

0.97 

902.06 

152.30 

7.62 

114.90 

960 

19.90 

0.99 

942.21 

147.80 

7.39 

126.86 

1024 

20.40 

1.02 

980.39 

148.90 

7.45 

134.32 
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Figure  E.l:  Speed  of  Small  Host-Node  Messages  (Ten  Repetitions) 
a.    Host-to-Node  Performance 

The  communication  rates  for  small  host-node  messages  with  a  repeti- 
tion count  of  ten  are  illustrated  in  Figure  E.l.  Communications  involving  the  host 
produce  very  irregular  results  (in  the  sense  that  the  relationship  between  length  and 
performance  is  not  straightforward).  The  experiment  was  executed  when  only  one 
user  was  logged  in  at  the  host  and  the  results  followed  the  same  general  pattern  on 
repeated  tests. 
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Figure  E.2:  Speed  of  Small  Messages  Between  Nodes  (Ten  Repetitions) 
b.    N ode-to-N ode  Performance 

In  the  absence  of  contention  for  the  communication  medium,  node- 
to-node  communications  within  the  cube  are  quite  predictable.  Figure  E.2  shows 
transmission  rates  for  small  messages  (up  to  one  kilobyte)  repeated  ten  times. 


183 


TABLE  E.2:  SHORT  MESSAGES  WITH  ONE  HUNDRED  REPETITIONS 


Message 
Length 

Node-to- 

-Node 

ftost-to- 

\Tode 

t 

U 

Rate 

t 

u 

Rate 

(Bytes) 

(msec) 

(msec) 

(kbytes/sec) 

(msec) 

(msec) 

(kbytes/sec) 

1 

68.60 

0.34 

2.85 

837.40 

4.19 

0.23 

2 

68.60 

0.34 

5.69 

818.30 

4.09 

0.48 

4 

68.70 

0.34 

11.37 

795.00 

3.98 

0.98 

8 

69.40 

0.35 

22.51 

774.50 

3.87 

2.02 

16 

70.30 

0.35 

44.45 

758.30 

3.79 

4.12 

32 

71.70 

0.36 

87.17 

737.10 

3.69 

8.48 

64 

75.30 

0.38 

166.00 

721.30 

3.61 

17.33 

128 

137.60 

0.69 

181.69 

1020.10 

5.10 

24.51 

192 

142.30 

0.71 

263.53 

1007.10 

5.04 

37.24 

256 

146.80 

0.73 

340.60 

1007.00 

5.04 

49.65 

320 

152.00 

0.76 

411.18 

1004.50 

5.02 

62.22 

384 

156.20 

0.78 

480.15 

1013.40 

5.07 

74.01 

448 

161.00 

0.81 

543.48 

1043.80 

5.22 

83.83 

512 

165.30 

0.83 

604.96 

1152.90 

5.76 

86.74 

576 

169.80 

0.85 

662.54 

1335.40 

6.68 

84.24 

640 

174.50 

0.87 

716.33 

1419.50 

7.10 

88.06 

704 

179.30 

0.90 

766.87 

1688.50 

8.44 

81.43 

768 

183.20 

0.92 

818.78 

1869.90 

9.35 

80.22 

832 

188.20 

0.94 

863.44 

1520.00 

7.60 

106.91 

896 

192.90 

0.96 

907.21 

1070.30 

5.35 

163.51 

960 

197.70 

0.99 

948.41 

1061.60 

5.31 

176.62 

1024 

202.40 

1.01 

988.14 

1048.80 

5.24 

190.69 

2.    Small  Messages  Repeated  One  Hundred  Times 

For  the  next  experiment  data  was  collected  from  runs  using  the  same  mes- 
sage lengths,  but  the  repetition  count,  /?,  was  raised  to  one  hundred.  This  gives 
te  =  */200,  as  shown  in  Table  E.2. 

a.    Host-to-Node  Performance 

Figure  E.3  gives  the  transmission  rates  corresponding  to  this  data. 
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Figure  E.3:  Speed  of  Small  Host-Node  Messages  (One  Hundred  Repetitions) 
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Figure  E.4:  Speed  of  Small  Messages  Between  Nodes  (One  Hundred  Repetitions) 

b.    Node-to-Node  Performance 

Figure  E.4  shows  the  transmission  rates  for  the  node-to-node  messages. 
This  data  may  have  important  implications.  Consider  the  transmission  of  a  matrix 
row-by-row  within  a  loop  (where  one  row  is  transmitted  each  time  through  the 
loop).  The  expected  communications  performance  is  related  to  the  number  of  bytes 
in  a  single  row  of  the  matrix,  not  the  size  of  the  entire  matrix. 
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3.    Larger  Messages 

The  final  test  considered  longer  messages  (1,024  <  I  <  262, 144)  that  were 
not  repeated.  This  gives  te  =  t/2.  Since  the  experiment  was  performed  over  a  rather 
large  set  of  message  lengths,  the  data  is  divided  at  an  arbitrary  point.  Messages 
of  64K  bytes  and  less  are  designated  "medium"  length  messages  and  placed  into 
Table  E.3.  Messages  of  length  128K  bytes  and  greater  are  designated  "long"  messages 
and  placed  into  Table  E.4.  There  is  no  hidden  significance  to  this  separation,  it  just 
made  for  tables  of  reasonable  length. 

The  figures  that  follow  are  based  upon  the  combined  data  of  both  of  these 
Tables.  The  host  terminates  execution  at  the  crecv()  if  we  ask  for  more  than  202,144 
bytes  in  a  single  message.  Chapter  2 — iPSC/2  C  Library  Calls — of  [Ref.  45:  pp.  2- 
16,  2-19]  explain:  "messages  to  or  from  a  host  process  are  limited  to  a  maximum 
of  256K  bytes.  There  is  no  limit  on  message  length  between  nodes."  This  explains 
why  the  data  stops  at  that  message  size. 
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TABLE  E.3:  MESSAGES  OF  MEDIUM  LENGTH 


Message 
Length 

Node-to-Node 

Host -to- 

Node 

t 

te 

Rate 

t 

te 

Rate 

(Bytes) 

(msec) 

(msec) 

(kbytes/sec) 

(msec) 

(msec) 

(kbytes/sec) 

1024 

2.20 

1.10 

909.09 

9.00 

4.50 

222.22 

2048 

2.80 

1.40 

1428.57 

10.40 

5.20 

384.62 

3072 

3.70 

1.85 

1621.62 

11.90 

5.95 

504.20 

4096 

4.40 

2.20 

1818.18 

13.40 

6.70 

597.01 

5120 

5.10 

2.55 

1960.78 

14.50 

7.25 

689.66 

6144 

5.80 

2.90 

2068.97 

14.50 

7.25 

827.59 

7168 

6.50 

3.25 

2153.85 

15.50 

7.75 

903.23 

8192 

7.40 

3.70 

2162.16 

16.50 

8.25 

969.70 

9216 

8.10 

4.05 

2222.22 

19.50 

9.75 

923.08 

10240 

8.80 

4.40 

2272.73 

18.00 

9.00 

1111.11 

11264 

9.50 

4.75 

2315.79 

18.90 

9.45 

1164.02 

12288 

10.30 

5.15 

2330.10 

19.00 

9.50 

1263.16 

13312 

10.90 

5.45 

2385.32 

19.60 

9.80 

1326.53 

14336 

11.80 

5.90 

2372.88 

20.30 

10.15 

1379.31 

15360 

12.50 

6.25 

2400.00 

21.90 

10.95 

1369.86 

16384 

13.20 

6.60 

2424.24 

22.40 

11.20 

1428.57 

17408 

13.90 

6.95 

2446.04 

23.30 

11.65 

1459.23 

18432 

14.60 

7.30 

2465.75 

24.90 

12.45 

1445.78 

19456 

15.40 

7.70 

2467.53 

24.30 

12.15 

1563.79 

20480 

16.10 

8.05 

2484.47 

27.30 

13.65 

1465.20 

21504 

16.80 

8.40 

2500.00 

27.10 

13.55 

1549.82 

22528 

17.60 

8.80 

2500.00 

27.00 

13.50 

1629.63 

23552 

18.40 

9.20 

2500.00 

27.80 

13.90 

1654.68 

24576 

19.10 

9.55 

2513.09 

29.30 

14.65 

1638.23 

25600 

19.80 

9.90 

2525.25 

29.40 

14.70 

1700.68 

26624 

20.50 

10.25 

2536.59 

30.60 

15.30 

1699.35 

27648 

21.30 

10.65 

2535.21 

30.90 

15.45 

1747.57 

28672 

22.10 

11.05 

2533.94 

33.50 

16.75 

1671.64 

29696 

22.70 

11.35 

2555.07 

38.50 

19.25 

1506.49 

30720 

23.50 

11.75 

2553.19 

37.90 

18.95 

1583.11 

31744 

24.20 

12.10 

2561.98 

37.90 

18.95 

1635.88 

32768 

24.90 

12.45 

2570.28 

38.10 

19.05 

1679.79 

65536 

48.50 

24.25 

2639.18 

59.90 

29.95 

2136.89 
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TABLE  E.4:  LONG  MESSAGES 


Message 
Length 

Node-to- 

Node 

Host-to- 

Node 

t 

U 

Rate 

* 

tt 

Rate 

(Bytes) 

(msec) 

(msec) 

(kbytes/sec) 

(msec) 

(msec) 

(kbytes/sec) 

131072 

95.60 

47.80 

2677.82 

109.40 

54.70 

2340.04 

150528 

109.60 

54.80 

2682.48 

123.60 

61.80 

2378.64 

161792 

117.70 

58.85 

2684.79 

131.60 

65.80 

2401.22 

162816 

118.40 

59.20 

2685.81 

132.90 

66.45 

2392.78 

163840 

119.10 

59.55 

2686.82 

133.60 

66.80 

2395.21 

164864 

119.90 

59.95 

2685.57 

135.00 

67.50 

2385.19 

165888 

120.60 

60.30 

2686.57 

136.30 

68.15 

2377.11 

172032 

125.00 

62.50 

2688.00 

140.80 

70.40 

2386.36 

182272 

132.40 

66.20 

2688.82 

148.10 

74.05 

2403.78 

192512 

139.70 

69.85 

2691.48 

155.60 

77.80 

2416.45 

202752 

147.10 

73.55 

2692.05 

164.60 

82.30 

2405.83 

223232 

161.80 

80.90 

2694.68 

181.10 

90.55 

2407.51 

243712 

176.50 

88.25 

2696.88 

194.80 

97.40 

2443.53 

253952 

183.80 

91.90 

2698.59 

202.80 

101.40 

2445.76 

259072 

187.60 

93.80 

2697.23 

205.50 

102.75 

2462.29 

262144 

J 

189.70 

94.85 

2699.00 

210.50 

105.25 

2432.30 
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Figure  E.5:  Speed  of  Large  Host-Node  Messages 
a.    Host-to-Node  Performance 

The  host-to-node  communication  rates  (for  large  messages)  are  illus- 
trated in  Figure  E.5. 


190 


Figure  E.6:  Speed  of  Large  Messages  Between  Nodes 

b.    Node-to-Node  Performance 

Figure  E.6  shows  the  transmission  rates  for  the  same  long  messages 
when  passed  among  nodes  of  the  hypercube.  To  move  the  plot  of  Figure  E.6  out 
into  the  open,  a  plot  of  transmission  rate  versus  logi0^  is  shown  in  Figure  E.7. 
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Figure  E.7:  Node-to-Node  Transmission  Rates  for  Large  Messages 
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D.   CONCLUSIONS 

One  of  the  obstacles  that  this  experiment  carefully  avoided  was  competition 
for  the  links.  Contention  for  communications  resources  may  be  inherent  in  certain 
parallel  programs.  Potential  causes  and  effects  of  contention  should  always  be  given 
due  consideration  in  the  crafting  of  a  parallel  application.  All  of  the  algorithms  that 
were  tested  in  this  research  work  involved  very  structured,  regular  communications 
schemes.  An  application  with  very  random  communication  patterns  should  be  ex- 
pected to  behave  very  differently.  Additionally,  the  communication  scheme  for  every 
program  in  this  work  was  designed  to  use  the  shortest  possible  path. 

The  circuit  switching  approach  has  the  disadvantage  that  a  single  message  must 
control  the  entire  path  from  origin  to  destination.  Under  a  less  controlled,  random 
pattern  of  communications  the  performance  of  the  communications  subsystem  might 
reasonably  be  expected  to  exhibit  degraded  performance.  Other  portions  of  this  the- 
sis show  that  a  communication-bound  algorithm  can  experience  severe  performance 
degradation  as  well.  There  is  no  specific  claim  that  the  results  obtained  in  this 
experiment  represent  an  vppcr  bound  for  node-to-node  communications  within  the 
hypercube,  but  they  are  probably  good  estimates  for  an  upper  bound. 

Host-node  communication  is  slower  than  node-to-node  communication.  This 
is  not  surprising  (consider  the  physical  distances  and  materials).  In  the  absence  of 
competition  for  the  links,  node-to-node  transmission  rates  are  essentially  predictable 
for  a  given  message  length.  There  is  a  tremendous  rise  in  transmission  rate  as  message 
length  goes  from  one  byte  to  the  vicinity  of  twenty  kilobytes.  Thereafter,  smaller 
(apparently  asymptotic)  performance  gains  are  achieved  by  increasing  the  message 
size.  A  similar  phenomenon  occurs  with  host-node  communications  but  it  takes 
much  longer  messages  to  break,  say  the  two  megabytes-per-second  transmission 
rate. 
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These  performance  measures  are  quite  appealing  for  long  messages,  but  con- 
sider transmissions  of  shorter  (and  possibly  repetitious)  messages.  The  data  shows 
that  short  messages  are  penalized,  even  if  they  are  part  of  a  loop  that  involves  a 
good  deal  of  communication.  Each  instance  of  csend()  or  crecvQ  is  distinct  and 
incurs  its  own  start-up  cost.  This  is  an  important  note  for  anyone  considering 
transmission  of  the  rows  (or  columns)  of  a  matrix  within  a  loop  structure.  The 
potential  of  (pre-transmission)  storage  of  matrices  (two-dimensional  arrays)  into 
one-dimensional  arrays  might  be  investigated  as  a  means  of  increasing  the  commu- 
nications rate  (provided  the  cost  of  copying  the  array  is  not  prohibitive). 

Communications  in  a  transputer  network  was  not  developed  in  this  work,  but 
Bryant  [Ref.  26]  gives  a  very  thorough  analysis  of  communications  and  calculations 
in  a  network  of  transputers.  On  pages  31-34,  Bryant  gives  a  good  summary  of 
unidirectional  and  bidirectional  data  transfer  rates.  He  discusses  link  interaction  (i.e., 
how  communications  performance  varies  as  one,  two,  or  all  four  of  the  transputer's 
links  are  engaged  in  communication)  on  pages  34-38  and  concludes  that  the  effects 
of  link  interaction  are  minimal. 

Bryant  also  discusses  the  effects  of  varied  communication  loads  on  processor 
performance.  On  pages  38-44,  he  finds  that  bombarding  a  transputer  with  many 
small  messages  while  it  is  trying  to  perform  calculations  can  severely  degrade  the 
processor's  performance.  His  Figures  3.8  and  3.9  show  that — with  only  one  link 
active — messages  of  size  100  bytes  and  larger  cause  negligible  performance  degrada- 
tion. With  all  four  links  active,  messages  of  size  greater  than  one  kilobyte  should  be 
used  to  free  the  processor  from  most  of  the  communications  overhead. 

Pages  36  and  37  of  Bryant's  thesis  show  the  effects  of  message  length  on  the 
communication  rate.  Bryant's  Figures  3.4  and  3.5  are  quite  similar  to  Figure  E.6 
above,  but  the  transputers  are  much  more  responsive  (i.e.,  there  seems  to  be  less 
overhead  involved,  so  the  peak  communications  rate  is  achieved  much  earlier).  In 
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fact,  the  transputers  are  near  their  peak  transmission  rate  with  messages  of  100  bytes 
and  messages  of  one  kilobyte  and  greater  always  travel  at  peak  rates. 

Comparing  a  transputer  system  to  an  iPSC/2  system — in  terms  of  communi- 
cations performance — is  essentially  a  lesson  in  the  differences  between  store-and- 
forward  switching  versus  circuit  switching  for  multi-hop  communications.  Bryant 
shows  [Ref.  2G  :  pp.  83-85]  that  the  store-and-forward  transmission  rates  suffer  as 
the  number  of  hops  grows.  The  direct-connect  (circuit  switching)  approach  recovers 
its  overhead  on  multi-hop  communications,  but  it  ties  up  the  entire  path  to  do  so 
(making  it  unavailable  to  other  potential  users).  The  key  difference  is  that  commu- 
nications performance  with  the  direct-connect  method  is  very  nearly  independent  of 
the  number  of  hops. 

The  transputer  system  seems  to  enforce  true  blocking  communications  on  both 
the  sending  and  receiving  ends  (byte-by-byte  acknowledgment  is  part  of  the  pro- 
tocol). The  iPSC/2  csendQ  is  not  blocking,  but  the  crecvQ  function  is  blocking. 
Proper  handling  of  these  issues  can  become  important  when  implementing  an  algo- 
rithm. Each  method  has  advantages  and  disadvantages,  but — at  least  for  the  current 
systems — transputers  seem  better  suited  for  applications  involving  short  messages 
over  short  distance  and  the  iPSC/2  seems  to  handle  long  messages  over  long  distances 
better. 

E.   SOURCE  CODE  LISTINGS 

The  source  code  listings  for  the  programs  used  for  these  tests  are  supplied  on 
the  pages  that  follow.  The  makefile  commtst.mak  appears  first  and  describes  the 
dependencies  among  the  files  and  compilation  procedures.  Next,  commtst.h  is  the 
header  file  associated  with  these  programs.  Finally,  the  actual  code  is  given  in  a  host 
program  called  commtst.c  and  the  node  program  commtstn.c. 
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commtst.mak 


1  #  Author:   Jonathan  E.  Hartman,  U.  S.  laval  Postgraduate  School 

2  #  Purpose:  Makefile  for  Bypercube  Communications  Test  Programs 

3  #  Date:    07  August  1991 

4 

5  all:    hostcode  nodecode 

6 

7  help: 

8  chelp 
9 

10 

n   # 

12  hostcode:    commtst.o  clargs.o 

13  cc   clargs.o   commtst.o  -host   -o  commtst 

14 

15  clargs.o:        clargs.h     clargs.c 

16  commtst.o:      commtst. h  commtst. c 

17 
18 

19  # 

20  nodecode:    commtstn.o 

21  cc   commtstn.o  -node   -o   commtstn 

22 

23   commtstn.o:    commtstn.o   commtst. h 

24 

25 

26  #  Execute   it!    

27  run:    all 

26   commtst   -d   3   -b    1024   -r   2 

29 
30 

31  #  Delete  object  files,  executables  

32  clean: 

33  rm  *  .  o 

34  rm  commtst 

35  rm  commtstn 

36 

37  #  EOF  commtst.mak   
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commtst.h 


1 

2 
3 
4 
5 
6 
7 
6 
9 
10 
11 
12 
13 
14 
15 
16 
17 
16 
19 
20 
21 
22 
23 
24 
25 
26 
27 
26 
29 
30 
31 
32 
33 
34 
35 
36 
37 
36 
39 
40 
41 
42 
43 
44 
45 
46 
47 
46 
49 


PROGRAM  INFORMATION 


SOURCE 
VERSION 
DATE 
AUTHOR 


commtst .h 

1.2 

07  August  1991 

Jonathan  E.  Hartman, 


U.  S.  laval  Postgraduate  School 


DESCRIPTION 


This  header  file  gives  common  information  for  use  across  the  host  program 
commtst. c  and  the  node  program  commtstn.c.   A  more  complete  description 
can  be  found  in  commtst. c. 


/ 


#ifndef  EXIT_FAILURE 
#define  EXIT_FAILURE 
#endif 


-1 


#define  MAX.CUBESIZE   16 
#define  ROOT  -1 


#define  RECEIVE 
#define  SEND 

#define  FALSE 
#define  TRUE 


TYPE  DEFINITION 


* 

*  The  following  structure  is  the  framework  that  the  root  processor  (host) 

*  uses  to  pass  instructions  to  the  worker  nodes  in  the  cube. 
*/ 

typedef  struct  { 

int  task; 
long  bytes; 
long  reps; 


/* 
/* 
/* 
int     destination[MAX_CUBESIZE] ;    /* 

}  Tasking; 


choose  RECEIVE  or  SEND  as  above    */ 
length  of  message  */ 

number  of  repetitions  */ 

for  senders:  identifies  addressees  */ 


/* 


EOF  commtst.h 


*/ 
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commtst.c 


1 

2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
16 
19 
20 
21 
22 
23 
24 
25 
26 
27 
26 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 
46 
47 
48 
49 
50 


PROGRAM  IMF0RMATI0N 


SOURCE 
VERSION 
DATE 
AUTHOR 

USAGE 

EXAMPLE 


REFERENCES 


commtst . c 

1.2 

07  August  1991 

Jonathan  E.  Hartman,  U.  S.  laval  Postgraduate  School 

commtst  [-d  dimension]  [-b  bytes]  [-r  repetitions]  [-v] 

If  you  type  'commtst  -d  3  -v  -b  1024  -r  10',  it  means  to 
run  the  program  on  a  dimension  3  hyper cube  in  the  verbose 
mode,  with  messages  of  length  1024  bytes,  and  10  repeti- 
tions for  each  message. 

[1]  iPSC/2  Programmer's  Reference  Manual 


DESCRIPTION 


This  program  runs  on  the  host.   It  orchestrates  various  point-to-point 
communication  tasks  between  nodes  of  a  hypercube.   The  time  of  round-trip 
communications  is  gathered  and  printed  out.   The  output  includes  the  time 
required  and  rate  of  communication  (taking  into  account  repetitions  and 
round-trips).   The  'verbose'  mode  gives  a  more  detailed  node-by-node 
accounting  of  the  run. 


char  *version  =  "Hypercube  Communications  Test,  Version  1.2"; 


ALGORITHM 


The  root  (host)  processor  determines  who  will  communicate  with  shorn,  and 
when.   No  node  operates  independently.  The  host  identifies  a  sender  and 
receiver(s).   The  host  also  gives  the  length  of  the  message  that  should 
be  passed  and  the  number  of  times  that  the  message  is  to  be  repeated 
(multiple  repetitions  may  be  required  when  the  message  is  short  since 
mclockO  returns  milliseconds).   The  'Tasking'  structure  holds  instruc- 
from  the  manager  (i.e.,  SEND  or  RECEIVE,  the  length  of  the  message,  num- 
ber of  repetitions,  and  addressees).   When  this  structure  is  received  at 
a  node,  it  performs  the  task  and  awaits  further  instructions  from  the 
manager  processor.   If  the  processor  is  a  sender,  it  returns  timing  data 
to  the  host  upon  completion. 
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commtst.c 


51 
52 
53 
54 
55 
56 
57 
58 
59 
60 
61 
62 
63 
64 
65 
66 
67 
68 
69 
70 
71 
72 
73 
74 
75 
76 
77 
78 
79 
80 
81 
82 
83 
84 
85 
86 
87 
88 
89 
90 
91 
92 
93 
94 
95 
96 
97 
98 
99 
100 


finclude  <stdio.h> 
♦include  "commtst.h" 
tinclude  "ipsc.h" 
•include  "macros. h" 
iinclude  "clargs.h" 

#define  ASCII.CONVERSION  48 
tdefine  CT.SIZE  4 


tdefine  NUM.ARGS 
tdeiine  DIM 
#define  BYTES 
#define  REPS 
#deiine  VERBOSE 


/*  for  char  ->  int  conversion  of  0. 

/*  for  cubetype  [3  size 

/*  -d  -b  -r  -v 

/*  index  values  into  optv[] 


.3 


♦/ 

*/ 

*/ 
*/ 


/♦ 


FUNCTI0H  DEFINITION 


*/ 


#ifdef  PROTOTYPE 


void  init(int  argc ,  char  **argv,  char  cubetype[CT_SIZE] , 
int  *dim,  long  *bytes,  long  *reps,  int  *verbose) 


#else 


void  init(argc,  argv,  cubetype,  dim,  bytes,  reps,  verbose) 

int   argc; 
char  **argv, 

cubetype [CT_SIZE] ; 
int   *dim; 
long  *bytes, 

*reps ; 
int   *verbose; 

fendif 
{ 

int  count  =  1 , 

valid  =  FALSE; 


Opt.Struct   *optv[KUM_ARGS] ; 


/* 

* 
*/ 


The  first  step  is  to  make  a  table  of  all  of  the  valid  arguments.   The 
structure  is  defined  more  carefully  in  clargs.h,  but  the  basic  idea  is 
that  we  have  an  array  of  pointers  to  type  Opt_Struct  (option  structure) 
...in  this  case,  there  are  MUM_ARGS  valid  arguments  and  the  next  few 
steps  take  care  of  allocation  and  definition  of  them.   When  this  is 
done,  it  is  time  to  call  interpret_args()  to  see  what  the  user  entered. 
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commtst.c 


101 

102 
103 

104 
105 
106 
107 
108 
109 
110 
111 
112 
113 
114 
115 
116 
117 
116 
119 
120 
121 
122 
123 
124 
125 
126 
127 
126 
129 
130 
131 
132 
133 
134 
135 
136 
137 
136 
139 
140 
141 
142 
143 
144 
145 
146 
147 
148 
149 
150 


optv 
optv 
optv 
optv 
optv 
optv 
optv 

/* 

optv 
optv 
optv 
optv 

optv 
optv 
optv 
optv 

optv 
optv 
optv 
optv 

optv 
optv 
optv 

*dim 


The  intel  compiler  didn't  like 
DIM]->argname[0]      =  '-'; 
DIM]->argname[l]      =  'd'; 
DIM]->subargc        =  1; 
DIM]->subargi         =  lEXT.LOIG; 


DIM]  =  (Opt 
BYTES]  =  (Opt 
REPS]  =  (Opt 
VERBOSE]  =  (Opt 
DIM]->lanswer 
BYTES] ->lanswer 
REPS]->lanswer 


.Struct  *) 
.Struct  *) 
.Struct  *) 
.Struct  *) 
■  (long  *) 
=  (long  *) 
:  (long  *) 


calloc(  1 

calloc(  1 

calloc(  1 

calloc( 

calloc( 

calloc( 

calloc( 


sizeof (0pt_Struct)  ) 
sizeof (0pt_Struct)  ) 
sizeof (Opt_Struct)  ) 
sizeof (0pt_Struct)  ) 
eizeof (long)  ) 
sizeof (long)  ) 
sizeof(long)  ) 


->argname  =  M-d";  etc.   */ 


BYTES] ->argname [0] 
BYTES] ->argname [l] 
BYTES] ->subargc 
BYTES] ->subargi 

REPS] ->argname [0] 
REPS] ->argname [l] 
REPS]->subargc 
REPS]->subargi 


s      '  —  • 


'b'; 

1; 
IEXT.L0MG; 


'r'; 

i; 

IEXT_L0HG; 


VERBOSE] ->argname[0]  =  '-'; 
VERBOSE] ->argname[l]  =  'v'; 
VERBOSE] ->subargc  =   0; 


•l; 


interpret_args(argc,  argv,  IUM_ARGS,  optv); 

if  (optv [DIM] ->found)   *dim  =  (int)  optv [DIM] ->lanswer [0] ; 

switch  (*dim)  { 


case  0 


case  1 


case  2  :    case  3 


break; 


default: 

while  (Ivalid)  { 

printf ("Enter  desired  cube  dimension  (in  {0,  1,  2,  3}):  "); 
scanf  ("*/.d" ,  dim); 
f flush(stdin) ; 
switch(*dim){ 

case  0  :  case  1  :  case  2  :  case  3  :  valid  =  TRUE;  break; 
} 
} 
}  /*  end  switchO  */ 
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commtst.r  L 


151  if    (optv [BYTES] ->found)        *bytes   =   optv [BYTES] ->lanswer [0] ; 

152 

153     valid  =  FALSE; 

154 

155  if  (*bytes  <  1)  { 

156  while  (Ivalid)  { 

157  printf ("Enter  message  length  (bytes):  "); 

158  scanf  ("'/.Id"  ,  bytes); 

159  If lush(stdin) ; 

160  if    (*bytes   >   0){  valid  =  TRUE;    } 

161  else    {   printf ("Message   length  Bust   be  positive . \n") ;    } 

162  } 

163  } 
164 

165  if  (optv [REPS] ->found)  {  *reps  =  optv [REPS] ->lanswer [0] ;  } 

166  else  { 

167 

166        printf ("Non-existing  (or  invalid)  repetition  count,  "); 

169  printf ("using  one  repetition. \n\n") ; 

170  *reps    =    1; 

171  > 
172 

173     (optv [VERBOSE] ->found)  ?  *verbose  =  TRUE  :  *verbose  =  FALSE; 

174 

175  cubetype[0]  =  'd';  /*  for  dimension  (to  follow)     */ 

176  cubetype[l]  =  (char)(*dim  +  ASCII_C0NVERSI0N) ; 

177  cubetype[2]  =  'f;  /*  means  nodes  are  386/387  combo  */ 

178  cubetype[3]  =  0; 
179 

180  printf  ("Initialization  complete ..  .Cube  Dimension:  */,d\n",  *dim) ; 

181  printf  ("  Message  Length:  */,ld\n" ,  *bytes); 

182  printf  ("  Repetitions:     */.ld\n\n",  *reps); 

183  if  (*verbose)  printf ("  Verbose  Mode:    ON"); 

184  } 

185  /*   End   init()    */ 

186 
187 
188 

189  #ifdef  PROTOTYPE 

190 

191  main(int   argc,    char  *argv[]) 

192 

193  #else 

194 

195  main(argc,    argv) 

196 

197  int     argc; 

198  char  *argv[]  ; 

199 

200  fendif 

201 


conimtst.c 


201  {  /*  begin  main()    */ 

202 

203  char   *cubename  =  "Hypercube" , 

204  cubetype[CT_SIZE]  , 

205  *msg, 

206  *nodecode  =   "commtstn"; 

207 

206  float   avg, 

209  avg_hostrate, 

210  avg_hosttime, 

211  avg_rate, 

212  avg_time, 

213  bytes, 

214  reps; 

215 

216  int     cubesize, 

217  dim, 

218  i, 

219  j, 

220  verbose; 
221 

222  unsigned  long  **timing_data; 

223 

224  Tasking  task_packet; 

225 

226 

227  printf  ("\n'/.s\n\n"  ,  version); 

228 

229  init(argc,    argv,    cubetype,   ftdim,   *(task_packet .bytes) , 

230  4(task_packet .reps) ,  tverbose); 

231 

232  bytes     =    (float)   task_packet .bytes; 

233  reps       =    (float)   task_packet .reps ; 

234  bytes   *=    (2.0  *  reps);        /♦     account  for  two-say  communications,    reps   */ 

235 

236  cubesize  =  P0W2(dim); 

237 

238     timing.data  =  (unsigned  long  **)  calloc (cubesize,  sizeof (unsigned  long*)); 

239 

240  for   (i   =  0;    i  <  cubesize;    i++)    { 

241 

242  timing_data[i]= (unsigned  long*)calloc(cubesize, sizeof (unsigned  long)); 

243  } 
244 

245  if    ( ! (msg  =   (char  *)    calloc(task_packet .bytes,    sizeof (char))))    { 

246 

247  printf ("main() :   Allocation  failure  for  msg.\n"); 

248  exit(EXIT_FAILURE); 

249  } 

250 
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251  /*     Get  the   cube  and  load  the  node   code     */ 

252 

253  getcube(cubename ,    cubetype,    MULL,    0); 

254  attachcube(cubename) ; 

255  setpid(O); 

256  load(nodecode,    ALL_N0DES,    I0DE_PID) ; 

257 
256 

259  /*   Perform  the  tasking,  receive  the  message,  return  it,  receive  and  print 

260  *  timing  data. .. repeat  for  all  players.   The  outer  loop  index,  i,  Hill 

261  *  represent  the  sender  node.   The  j  index  runs  the  other  (RECEIVE) 

262  »     players . 

263  */ 
264 

265  for   (i   =  0;    i   <  cubesize;    i++)   { 

266 

267  /*      Get  the  receivers  ready  first     */ 

268  task_packet .task  =  RECEIVE; 

269  task_packet .destination[0]    =   i; 

270  task_packet .destination[l]    =  cubesize;    /*   impossible  flags   end   */ 

271 

272  for    (j    =  0;    j    <   i;    j++)   { 

273 

274  csend(0,    fttask_packet ,    sizeof (Tasking) ,    j,    I0DE_PID) ; 

275  } 
276 

277  for    (j    =    (i+1);    j    <   cubesize;    j++)    { 

278 

279  csend(0,    Atask_packet ,    sizeof (Tasking) ,    j,   N0DE_PID) ; 

280  } 
281 

282  /*     Then  prepare  the   sender  ==>  he   can  start      */ 

263  task_packet . task  =  SEND; 

264  for    (j    =  0;    j    <   i;    j++)  task_packet .destination [j]    =   j; 

285  task_packet .destination[i]  =  ROOT; 

286  for    (j    =    (i+1);    j    <   cubesize;    j++)     task.packet .destination[j]    =   j; 

287 

288  csend(0,    fttask_packet ,    sizeof (Tasking) ,    i,   I0DE_PID); 

269 

290  /*     Receive  from  the   sender  and  return  his  message     */ 

291  for    (j    =  0;    j    <  task_packet .reps;    j++)   { 

292 

293  crecv(ANY_TYPE,   msg,    task.packet .bytes) ; 

294  csend(0,   msg,    task_packet .bytes,    i,    I0DE.PID) ; 

295  } 
296 

297  /*     Receive  the  timing  data  from  this  run  and  print   it      */ 

298  crecv(ANY_TYPE,    timing_data[i] ,    (cubesize  *   sizeof (unsigned  long))    ); 

299 

300  }  /*   end  for    (i)    */ 
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301  lor    (i   =   0;    i   <   cubesize;    i++)    { 

302 

303  if    (verbose)    { 

304 

305  printf ("Source       Dest .        Time    (msec)        Rate   (kilobytes/second)\n") ; 

306  print! ("======        =====       ===========       =======================\n") ; 

307  printf  07.4d  HOST     '/.lOlu  ",    i,   timing_data[i] [i] ) ; 

308  print*  ("  '/.10.2f\n",    (bytes   /   ((float)   timing_data[i]  [i]  ) )    ); 

309  } 
310 

311  avg         =  0.0; 

312 

313  for   (j   =  0;    j    <   cubesize;    j++)   { 

314 

315  if    (i    !=   j)   { 

316 

317  avg  +=    (float)   timing_data[i] [j]  ; 

318 

319  if    (verbose)    { 

320 

321  printf ("  y.4d",    j); 

322  printf("       */,101u  ",   timing_data[i]  [j] ) ; 

323  printf  ('7.10. 2f\n",    (bytes   /    ((float)   timing_data[i]  [j]  ))    ); 

324  } 

325  } 
326 

327  if  (j  ==  (cubesize  -  1))   { 

328 

329  avg  /=  (float)  cubesize  -  1; 

330 

331  if  (verbose)  { 

332 

333  printf ("============================================")  ; 

334  printf ("==========\n") ; 

335  printf  ("Averages '/.9.1f  msec  ",  avg); 

336  printf ("  y,7.2f",  bytes/avg  ); 

337  printf ("  kbytes/sec\n\n\n") ; 

338  } 

339  } 

340  }  /*   end  for(j)    */ 
34i            }  /*   end  for(i)    */ 

342 

343  for   (i   =  0;    i   <   cubesize;    i++)    { 

344 

345  for   (j    =  0;    j    <  cubesize;    j++)   { 

346 

347  (i  ==  j)    ?  avg_hosttime  +=  timing_data[i] [j]    : 

348  avg_time  +=  timing_data[i]  [j]    ; 

349  } 

350  } 
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351 

352 

353 

354 

355 

356 

357 

358 

359 

360 

361 

362 

363 

364 

365 

366 

367 

36e  } 

369  /* 


avg_hosttime  /=  cubesize; 
avg_hostrate   =  bytes/avg_hosttime ; 

avg_time     /=  ((cubesize  -  1)  *  cubesize); 
avg_rate      =  bytes/avg_time; 

printf ("If  we  average  all  of  the  times  and  rates ... .\n\n") ; 

printf("    Average  Time:   */.9.1f  millisecondsW ,  avg_time) ; 

printf ("    Average  Rate:   */,10.2f  kilobytes/second\n\n\n" ,  avg_rate) ; 

printf ("NOTE:   Average  and  Rate  values  are  for  the  nodes  ONLY.W); 
printf("       They  do  not  include  the  host  timing  data.\n\n\n") ; 

printf ("The  averages  for  the  node  < — >  host  communications  were:\n\n"); 
printf  ("    Average  Time:   */,9.1f  millisecondsW ,  avg_hosttime) ; 
printf  ("    Average  Rate:   '/,10.2f  kilobytes/second\n\n\n" ,  avg_hostrate) 


EOF  commtst.c 


*/ 
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l 
2 
3 
4 
5 
6 
7 
8 
*9 
10 
11 
12 
13 
14 
15 
16 
17 
16 
19 
20 
21 
22 
23 
24 
25 
26 
27 
26 
29 
30 
31 
32 
33 
34 
35 
36 
37 
36 
39 
40 
41 
42 
43 
44 
45 
46 
47 
46 
49 
50 


PROGRAM  INFORMATION 


SOURCE 
VERSION 
DATE 
AUTHOR 


conuntstn .  c 

1.2 

07  August  1991 

Jonathan  E.  Hartman,  U.  S.  Maval  Postgraduate  School 


DESCRIPTION 


This  program  is  loaded  by  commtst.c  (which  runs  on  the  host).   This  code 
(commtstn.c)  runs  on  the  nodes  of  a  hypercube  created  by  the  host  program. 
For  more  information,  see  comntst.c. 


#include  <stdio.h> 
#include  "commtst.h' 
#include  "ipsc.h" 

#define  SUCCESS  0 


#ifdef  PROTOTYPE 

main(int  argc,  char  *argv[]) 
#else 

main(argc,  argv) 


int   argc; 
char  *argv[]  ; 


tendif 


char  *msg; 

int  cubesize  =  numnodes(), 

i, 

J. 
return_addr; 

long  rep; 

unsigned  long  start,  *timing_data; 

Tasking  task_packet; 
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51  timing_data   =    (unsigned   long*)    calloc(cubesize,    sizeof (unsigned   long)); 

52 

53  for    (i    =   0;    i   <   cubesize;    i++)    { 

54 

55         crecv(ANY_TYPE,  *task_packet ,  sizeof (Tasking) ) ; 

56 

57         nsg  =  (char  *)  calloc(task_packet . bytes ,  sizeof (char) ) ; 

56 

59         switch  (task_packet . task)  { 

60 

61  case   RECEIVE    : 

62 

63  return_addr   =   task.packet .destination [0] ; 

64 

65  for  (rep  =  0;  rep  <  task_packet .reps ;  rep++)  { 

66 

67  crecv(ANY_TYPE,    msg,    task.packet . bytes) ; 

68  csend(0,    msg,    task_packet . bytes ,    return_addr ,    B0DE_PID); 

69  > 
70 

71  break; 

72 
73 

74  case   SEND    : 

75 

76  j    =   0; 

77 

7e  while    ( ( j<cubesize)**(task_packet .destination [j] <cubesize) )    { 

79 

so  start   =   mclock(); 

81 

82  for    (rep   =   0;    rep   <   task_packet . reps ;    rep++)    { 

63 

64  (j    ==   mynodeO)    ? 

85  c s end ( 0, msg, task_ packet .byteB .myhost () ,  I0DE_PID) : 

86  csend(0,    msg,    task_packet . bytes ,    j,    H0DE_PID) ; 
87 

88  crecv(AKY_TYPE,    msg,    task.packet .bytes)  ; 

69  } 

90 

91  timmg_data[j]    =   mclockO    -   start; 

92 

93  j  +  +; 

94  > 
95 

96  /*      Return   the   timing  data      */ 

07  csend(0,    timing.data,    (cubesize   *   sizeof (unsigned   long)), 
96  myhostO,    I0DE_PID)  ; 

99 

100  break; 
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101  default    : 

102 

103  printf  ("Unrecognized  task   at   node  */,ld.\n",   mynode()    ); 

104  exit(EXIT_FAILURE); 
105 

106  }   /*   end  switchO    */ 

107 

106 

109  Iree(msg); 

no 

in 

112  }  /*   end  for()    */ 

113 

ii4  return(SUCCESS) ; 

115 

116  } 

117  /* ===========  EOF     commtstn.c  =========== */ 
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APPENDIX  F 
MATRIX  LIBRARY 

This  appendix  contains  part  of  the  matrix  library,  matlib  that  is  often  used 
and  referenced  in  other  sections  and  code.  It  could  be  argued  that  "matrix  library" 
is  a  misnomer  since  much  of  the  code  has  little  to  do  with  matrices.  This  criticism  is 
true,  but  I  will  defend  the  name  since  the  entire  reason  for  the  creating  such  a  library 
was  to  handle  matrices  in  a  more  reasonable  way.  The  last  section  of  this  appendix 
contains  all  of  the  source  code  for  Gauss  factorization  with  partial  pivoting,  and  a 
short  excerpt  from  the  complete  pivoting  code. 

The  specifications  and  a  portion  of  the  source  code  for  the  library  are  given  on 
the  pages  to  follow.  The  original  intent  was  to  include  the  source  code  in  its  entirety, 
but  this  would  require  more  than  double  the  current  number  of  pages  so  the  source 
has  been  omitted.  The  files  are  divided  into  three  logical  groups: 

1.  Makefiles  that  simplify  maintenance  of  the  library,  show  dependencies  among 
the  files,  and  describe  the  compilation  procedures  that  are  used  to  generate  the 
loadable  (executable)  code. 

2.  Standard  files  (mostly  C  header  files)  that  make  definitions  available  (for  con- 
sistency) across  a  wide  range  of  files.  The  range  is  implied  by  the  content  of 
the  file.  These  files  include  manifest  constants  that  are  installed  using  the  C 
Preprocessor  #def  ine  directive,  type  definitions  that  are  intended  for  use  across 
several  files,  and  macro  definitions  that  are  expanded  by  the  C  Preprocessor. 

3.  Source  code  files  that  appear  in  pairs,  like  filename. h  and  filename. c  or  (mostly) 
as  a  header  file  alone.  The  header  file  gives  remarks,  definitions  of  manifest  con- 

209 


■  i  .mi  ■  i,  type  definitions,  and  fun<  tion  de<  larations  (specincal  ions)  that  pertain  to 
the  associated  source  code  (i .<■.,  the  <  ode  within  filcnAmeuc).  Again,  I  he  lal  tei 
has  been  omitted  in  mosl  cast 

A.    The  GaUSI  factorization  Code.     All  of  tli«*  sou  ret*  code  for    the  partial   pivoting 

version  is  given,  and  an  excerpl  oi  I  li<"  pivot  election  function  from  the  complete 

pi v< ii  in)'  ( i xi(  is  &] si »  pi< ivided 

A.  MAKEFILES 

logc.mak  This  makefile  is  a  standard  template  l<>i  programs  compiled  with  the 
Logical  Systems  C  (version  89.1)  product. 

in.it  lib.mak  This  makefile  is  used  to  translate  m&tlib  ini<»  a  useable  form.  With 
Logical  Systems  C,  it  creates  a  library  suitable  f<»i  installation  and  use  as  any 
othei  normal  C  library.  The  portion  of  tin-  makefile  used  on  the  Intel  iPSC/2 
miiii>!\  works  in  the  current  directory  to  1  ranslate  the  source  into  object  code  so 
tint  <>i Imi  progi ams  can  reference  it . 


logc.mak 


1  *  

2  # 

3  #   AUTHOR   :    Jonathan  E.  Hartman,  U.  S.  laval  Postgraduate  School 

a   #   PURPOSE  :   Makefile  for  Hypercube  Communications  Test  Programs  (LogC) 

5  #   DATE        10  August  1991 

6  * 

7  # 

8 

9  R00TC0DE=f ilename 
10   N0DEC0DE=f ilename 
li    NIF_FILE=f ilename 
12 
13 

14  # 0PTI0MS   AND  DEFIIITIONS        

15  # 

16  #   The  following  section  establishes  various  options  and  definitions.  We 

17  #   start  with  PP,  the  Logical  Systems  C  Preprocessor.  The  '-dX'  option 
16  #   (with  no  macro_expression)  is  like  'Wdefine  XI'.  Next  we  set  up  the 

19  #   compilation  options  for  Logical  Systems'  TCX  Transputer  C  Compiler.  The 

20  #   '-c'  means  compress  the  output  file.  The  options  beginning  with  '-p' 

21  #   tell  TCX  to  generate  code  for  the  appropriate  processor: 

22  # 

23  #       -p2        T212  or  T222 

24  #       -p25        T225 

25  #       -p4         T414 

26  #       -p45        T400  or  T425 

27  #       -p8         T800 

26  #       -p85        T801  or  T805 

29  # 

30  #   Logical  Systems'  TASM  Transputer  Assembler  is  next.  The  '-c'  means 
3i  #   compress  the  output  file  (it  can  cut  it  in  half)!  The  '-t'  is  used 

32  #   because  the  input  to  TASM  will  be  from  a  language  translator  (TCX's 

33  #   output)  and  not  from  assembly  source  code. 

34  # 

35  #   The  final  list  tells  TLKK  which  libraries  to  look  at  during  linking. 

36  #   It  also  establishes  an  entry  point.  You  should  always  use  _main  for 

37  #   the  root  node;  otherwise  use  _ns_main  (for  other  nodes). 

38 

39  PP0PT2=-dPR0T0TYPE  -dTRANSPUTER  -dT212 

40  PP0PT4=-dPR0T0TYPE  -dTRANSPUTER  -dT414 

41  PP0PT8=-dPR0T0TYPE  -dTRANSPUTER  -dT800 

42  TCX0PT2=-cp2 

43  TCX0PT4=-cp4 

44  TCX0PT8=-cp8 

45  TASM0PT=-ct 

46  T2LIB=t21ib.tll 

47  T4LIB=matlib4.tll  t4cube.tll  t41ib.tll 

48  T8LIB=matlib8.tll  t8cube.tll  t81ib.tll 

49  RENTRY=_main 

50  NENTRY=_ns_main 
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51 
52 
53 
54 
55 
56 
57 
58 
59 
60 
61 
62 
63 
64 
65 
66 
67 
66 
69 
70 
71 
72 
73 
74 
75 
76 
77 
76 
79 
80 
81 
82 
83 
84 
85 
86 
87 
86 
89 
90 
91 
92 
93 
94 
95 
96 
97 
98 
99 
100 


DEFAULT  ===>  MAKE  ALL 


all:  $(R00TC0DE).tld  $(N0DEC0DE) .tld 


$(R00TC0DE):  $(R00TC0DE) . tld 

$ (R00TC0DE) . tld :  $ (R00TC0DE) . trl 
echo  FLAG   c  > 

echo  LIST  $(R00TC0DE) .map  » 
echo  INPUT  $(R00TC0DE) .trl  » 
echo  ENTRY  $(RENTRY)  » 
echo  LIBRARY  $(T4LIB)  » 
tlnk  $(R00TC0DE).lnk 


ROOT  CODE 


$(R00TC0DE).lnk 
$(R00TC0DE).lnk 
$(R00TC0DE).lnk 
$(R00TC0DE).lnk 
$(R00TC0DE).lnk 


$(R00TC0DE).trl:  $(R00TC0DE) . c 

pp   $(R00TC0DE).c   $(PP0PT4) 
tcx   $(R00TC0DE).pp  $(TCX0PT4) 
tasm  $(R00TC0DE).tal  $(TASM0PT) 


$(N0DEC0DE):  $(N0DEC0DE) . tld 


NODE  CODE 


$(N0DEC0DE).trl 


$(N0DEC0DE).tld 
echo  FLAG 
echo  LIST 
echo  INPUT 
echo  ENTRY 
echo  LIBRARY  $(T8LIB) 
tlnk  $(N0DEC0DE).lnk 


$(N0DEC0DE).map  » 
$(N0DEC0DE).trl  » 
$(NENTRY)  » 


>> 


$(N0DEC0DE).lnk 
$(N0DEC0DE).lnk 
$(N0DEC0DE).lnk 
$(N0DEC0DE).lnk 
$(N0DEC0DE).lnk 


$(NODECODE).trl:  $(N0DEC0DE) . c 

pp   $(N0DEC0DE).c   $(PP0PT8) 
tcx  $(N0DEC0DE).pp  $(TCX0PT8) 
tasm  $(N0DEC0DE).tal  $(TASM0PT) 
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101 

102  #  execution      

103    # 
104 

105  run:    l(ROOTCODE) .tld  S(MODECODE) .tld  $(HIF_FILE) .nil 

106  Id-net  $(MIF_FILE) 

107 
106 

109  # CLEAN     UP        

110  # 

111 

112  clean: 

113  del  $(R00TC0DE).lnk 

114  del  $(I0DEC0DE).lnk 
lis  del   $(R00TC0DE) .map 

116  del   $(H0DEC0DE) .map 

117  del   $(R00TC0DE).tal 
us  del  $(H0DEC0DE).tal 

119  del  $(R00TC0DE).pp 

120  del  $(N0DEC0DE).pp 

121  del  $(R00TC0DE).trl 

122  del  $(R0DEC0DE).trl 

123 
124 

125  #  EOF  logc.mak   
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MAKEFILE  FOR  MATRIX  LIBRARY 


3  # 

4  # 


SOURCE 

DATE 

AUTHOR 

PURPOSE 


matlib.mak 

17  August  1991 

Jonathan  E.  Hartman, 


U. 


S.  laval  Postgraduate  School 
Make  the  matrix  library  'matlib'. 


REMARKS  :    This  makefile  works  with  Logical  Systems  C,  version  89.1, 
and  the  Intel  iPSC/2  compiler.  The  LogC  portions  of  this 
makefile  actually  construct  libraries  of  the  functions  available  in  the 
source  files  indicated.  There  are  two  libraries  generated — matliM.tll 

t  matlib8.tll since  the  code  is  compiled  for  T414  or  T800  processors. 

For  the  Intel  compiler,  I  have  not  created  a  library;  but  have  used  the 
object  code  as  needed.   There  are  a  few  sections  that  pertain  to  both 
compilers.   The  sections  that  only  pertain  to  a  particular  compiler  are 
clearly  marked  'Intel  iPSC/2'  or  'Logical  Systems  C. 


1.)  DEFINITIONS  AND  OPTIONS 


5 

6 

7 

8 

9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 

33  THISMAKEFILE=matlib.mak 
34 
35 

36  # =  =  =  =  =  =  ===  =  =  = 

37  # 
38 

39  #  MATLIBDIR  is  the  directory  that  contains  the  matlib  files 

40  MATLIBDIR  =  /usr/hartman/matlib 

41  OBJECTS   =  clargs.o  comm.o  hcube.o  generate. o  mat_ops.o  matrixio.o  memory. o  math.o 
sep.o  timing. o  vec_ops.o 

42 
43 
44 
45 

46  # =  =  =  ===  =  = 

47  « 
48 

49  T414LIBNAME=matlib4 


The  following  options  and  definitions  are  required.  A  more  thorough 
explanation  can  be  found  in  'logc.mak'  or  in  the  Logical  Systems  C 
Transputer  Toolset  manual. 


1.1)  Intel  iPSC/2 


1.2)  Logical  Systems  C 
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50  T800LIBIAHE=matlib8 
51 

52  TRL4FILES=clargs . trl4  comm.trl4  complex. trl4  generate . trl4  machine. trl4  mat_ops.trl4 
math.trl4  matrixio .trl4  memory. trl4  num_8ys.trl4  sep.trl4  timing. trl4  vec_ops.trl4 

53  TRL8FILES=clarg8.trl8  comm.trl8  complex. trl8  generate . trl8  machine. trl8  mat_ops.trl8 
math.trl8  matrixio. trl8  memory. trl8  num_sys.trl8  sep.trl8  timing. trl8  vec_ops.trl8 

54 

55  TLIB4FILES=clarg8  comm  complex  generate  machine  mat_ops  math  matrixio  memory  num_Bys 
sep  timing  vec_ops 

56  TLIB8FILES=clargs  comm  complex  generate  machine  mat_ops  math  matrixio  memory  num_sys 
sep  timing  vec_ops 

57 

68  PP0PT2=-dPR0T0TYPE  -dTRANSPUTER  -dT212 

59  PP0PT4=-dPR0T0TYPE   -dTRAISPUTER  -dT414 

60  PP0PT8=-dPR0T0TYPE   -dTRANSPUTER  -dT800 
61 

62  TCX0PT2=-cp2 

63  TCX0PT4=-cp4 

64  TCX0PT8=-cp8 
65 

66  TASM0PT=-ct 

67 

68  T2LIB=t21ib.tll 

69  T4LIB=matlib4.tll   t4cube.tll   t41ib.tll 

70  T8LIB=matlib8.tll   t8cube.tll   t81ib.tll 

71 

72  RENTRY=_main 

73  NENTRY=_ns_main 

74 
75 
76 
77 
76 

79  # =  =  =====    2.)  INSTRUCTIONS  FOR  DEFAULT  MAKE   ======= 

80  # 

81  #   The  following  sections  give  the  default  (since  they  appear  first  in  the 

82  #   makefile)  options  for  this  makefile.   By  commenting  one  or  the  other 

83  #   out,  one  can  get  to  the  defaults  easily. 

64  « 

65  #  ========================================================= 

86 

87  ipse:   imatlib 

88  clean:  iclean 

89  #  tptr:  tmatlib 

90  #   clean:  telean 

91 
92 

93  # ===============   2.1)  Intel  iPSC/2   ============  === 

94  * 

95 
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96   imatlib:    $(OBJECTS) 

97 

96 
99 

ioo  # ============        2.2)     Logical   Systems   C       ============ 

101  # 

102  M  Make  everything  and  install  in  the  library  directory  designated  by  the 

103  #  environment  variable  TLIB. 

104 
105 

106  tmatlib: 

107  make  -i   $(THISMAKEFILE)   $(T414LIBIAME) .til 
106  make   -i   l(THISMAKEFILE)    install4 

109  make  -i  l(THISMAKEFILE)  tclean 

no  make  -1  $(THISMAKEFILE)  $(T800LIBNAME) .til 

in  make  -f  $(THISMAKEFILE)  install8 

112  make  -i  $(THISMAKEFILE)  tclean 

113  make  -t  $(THISMAKEFILE)  install_headers 

114 
115 

116  # CREATE  T414  VERSION  OF  THE  LIBRARY   

117 
116 

119  $(T414LIBNAME).tll    :    $(TRL4FILES) 

120  tlib  $(T414LIBNAME)    -b  $(TLIB4FILES) 
121 

122  clargs.trl4    :    clargs.h  clargs.c 

123  pp        clargs.c  $(PP0PT4) 

124  tcx      clargs.pp  $(TCX0PT4) 

125  tasm  clargs.tal  $(TASM0PT) 

126 

127  comm.trl4    :    comm.h  conun.c 

126  pp       comm.c  $(PP0PT4) 

129  tcx     comm.pp  $(TCX0PT4) 

130  tasm  comm.tal  $(TASM0PT) 

131 

132  complex. trl4    :    complex. h  complex. c 

133  pp       complex. c  $(PP0PT4) 

134  tcx     complex. pp  $(TCX0PT4) 

135  tasm  complex. tal  $(TASM0PT) 
136 

137  generate. trl4    :    generate. h  generate. c  matrix. h  memory. trl4 

136  pp       generate. c  $(PP0PT4) 

139  tcx     generate. pp  $(TCX0PT4) 

140  tasm  generate. tal       S(TASMOPT) 

141 

142  hcube.trl4    :   hcube.h  hcube.c 

143  pp       hcube.c  $(PP0PT4) 

144  tcx     hcube.pp  $(TCX0PT4) 

145  tasm  hcube.tal  $(TASM0PT) 
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46 

47  machine. trl4    :    machine. h  machine. c 

48  pp        machine. c  $(PP0PT4) 

49  tcx     machine. pp  $(TCX0PT4) 

50  tasm  machine. tal  $(TASM0PT) 

51 

52  mat_ops.trl4    :    mat_ops.h  mat_ops.c  matrix. h 

53  pp        mat_ops.c  $(PP0PT4) 

54  tcx     *at_ops.pp  KTCX0PT4) 

55  tasm  mat_ops.tal  $(TASM0PT) 
56 

57  math.trl4    :    math.h  math.c 

56  pp       math.c  $(PP0PT4) 

59  tcx     math.pp  $(TCX0PT4) 

60  tasm  math. tal  l(TASMOPT) 
61 

62  matrixio. trl4    :   matrixio.h  matrixio.c   ascii.h  matrix. h  memory. trl4 

63  pp        matrixio.c  $(PP0PT4) 

64  tcx     matrixio. pp         $(TCX0PT4) 

65  tasm  matrixio. tal       $(TASM0PT) 

66 

67  memory. trl4  :  memory. h  memory. c  matrix. h 

66  pp   memory. c       $(PP0PT4) 

69  tcx     memory. pp  $(TCX0PT4) 

70  tasm  memory. tal  $(TASM0PT) 

71 

72  num_sys.trl4    :    num_sys.h  num_sys.c  matrix. h 

73  pp       num_sys.c  $(PP0PT4) 

74  tcx     num_sys.pp  $(TCX0PT4) 

75  tasm  num.sys.tal  t(TASMOPT) 

76 

77  sep.trl4    :    sep.h  sep.c 

76  pp        sep.c  $(PP0PT4) 
79  tcx     sep.pp  $(TCX0PT4) 
so  tasm  sep.tal  $(TASM0PT) 

61 

82  timing. trl4    :    timing. h  timing. c 
63  pp       timing. c  $(PP0PT4) 

84  tcx     timing. pp  $(TCX0PT4) 

85  tasm  timing. tal  $(TASM0PT) 

86 

87  vec_op8.trl4    :    vec_ops.h  vec_ops.c 

88  pp       vec_ops.c  $(PP0PT4) 

89  tcx      vec_ops.pp  $(TCX0PT4) 

90  tasm  vec.ops.tal  t(TASMOPT) 

91 
92 
93  # CREATE  T800   VERSION  OF  THE  LIBRARY        
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196  $(T800LIBKAME).tll    :    $(TRL8FILES) 

197  tlib  $(T800LIBMAME)    -b  $(TLIB8FILES) 
196 

199  clargB.trl8    :    clargs.h  clargs.c 

200  pp        clargs.c  $(PP0PT8) 

201  tcx     clargs.pp  $(TCX0PT8) 

202  tasm  clargB.tal  $(TASM0PT) 

203 

204  comm.tr 18    :    comm.h  comm.c 

205  pp        comm.c  $(PP0PT8) 

206  tcx     comm.pp  KTCX0PT8) 

207  tasm  comm.tal  t(TASMOPT) 

208 

209  complex. trl8    :    complex. h  complex. c 

210  pp        complex. c  $(PP0PT8) 
2ii   tcx     complex. pp  $(TCX0PT8) 
212  tasm  complex. tal  $(TASM0PT) 

213 

214  generate. trl8    :    generate. h  generate. c  matrix. h  memory. trl8 

215  pp       generate. c  $(PP0PT8) 

216  tcx     generate. pp         $(TCX0PT8) 

217  tasm  generate. tal        $(TASM0PT) 
216 

219  hcube.trl8    :   hcube.h  hcube.c 

220  pp       hcube.c  $(PP0PT8) 

221  tcx     hcube.pp  $(TCX0PT8) 

222  tasm  hcube.tal  $(TASM0PT) 

223 

224  machine. trl8    :    machine. h  machine. c 

225  pp       machine. c  $(PP0PT8) 

226  tcx     machine. pp  $(TCX0PT8) 

227  tasm  machine. tal  $(TASM0PT) 

228 

229  mat_ops.trl8    :    mat_ops.h  mat_ops.c  matrix. h 

230  pp       mat_ops.c  $(PP0PT8) 
23i  tcx     mat.ops.pp  $(TCX0PT8) 
232  tasm  mat.ops.tal  $(TASM0PT) 

233 

234  math. trl8    :    math.h  math.c 

235  pp       math.c  $(PP0PT8) 

236  tcx     math.pp  $(TCX0PT8) 

237  tasm  math. tal  $(TASM0PT) 

238 

239  matrixio.trl8    :   matrixio.h  matrixio.c  ascii.h  matrix. h  memory. trl8 

240  pp       matrixio.c  $(PP0PT8) 

241  tcx     matrixio.pp         $(TCX0PT8) 

242  tasm  matrixio.tal        $(TASH0PT) 
243 

244  memory.tr 18    :   memory. h  memory . c  matrix. h 

245  pp       memory. c  $(PP0PT8) 
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246  tcx     memory. pp  $(TCX0PT8) 

247  tasm  memory. tal  $(TASM0PT) 
248 

249  num_sys.trl8    :    num_sys.h  num_sys.c  matrix. h 

250  pp       num.sys.c  $(PP0PT8) 

251  tcx     num_sys.pp  $(TCX0PT8) 

252  tasm  num_Bys.tal  $(TASM0PT) 

253 

254  sep.trl8    :    sep.h   sep.c 

255  pp        sep.c  $(PP0PT8) 

256  tcx     sep.pp  $(TCX0PT8) 

257  tasm  sep. tal  S(TASHOPT) 

258 

259  timing. trl8    :    timing. h  timing. c 

260  pp       timing. c  $(PP0PT8) 
26i  tcx     timing. pp  $(TCX0PT8) 
262  tasm  timing. tal  S(TASHOPT) 

263 

264  vec_ops.trl8    :    vec_ops.c  vec_ops.h 

265  pp       vec.ops.c  $(PP0PT8) 

266  tcx     vec.ops.pp  $(TCX0PT8) 

267  tasm  vec_ops.tal  S(TASMOPT) 

268 
269 

270  # COPY  LIBRARIES  TO  TLIB  DIRECTORY   

271 

272  install4: 

273  copy   $(T414LIBNAME).tll   $(TLIB) 

274 

275  install8: 

276  copy   $(T800LIBNAME).tll   $(TLIB) 

277 
278 

279  #  COPY  HEADER  FILES  TO  STANDARD  INCLUDE  DIRECTORY 

280 

281  install_headers : 

282  copy  ascii.h         $(TLIB)\.  Ainclude 

283  copy  macros. h       $(TLIB)\.  Ainclude 

284  copy  matrix. h       $(TLIB)\.  Ainclude 

285  copy  clargs.h       $(TLIB)\.  Ainclude 

286  copy  comm.h  $(TLIB)\.  Ainclude 

287  copy  complex. h     $(TLIB)\.  Ainclude 

288  copy  generate. h  $(TLIB)\.  Ainclude 

289  copy  hcube.h         $(TLIB)\.  Ainclude 

290  copy  machine. h     $(TLIB)\.  Ainclude 

291  copy  mat_ops.h     $(TLIB)\.  Ainclude 

292  copy  math.h  $(TLIB)\.  Ainclude 

293  copy  matrixio.h  $(TLIB)\.  Ainclude 

294  copy  memory. h       $(TLIB)\.  Ainclude 

295  copy  num_sys.h     $(TLIB)\.  Ainclude 
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296  copy  sep.h  $(TLIB)\.  Ainclude 

297  copy  timing. h       $(TLIB)\.  Ainclude 
296  copy  vec.ops.h     $(TLIB)\  .  Ainclude 

299 
300 
301 
302 
303 

304  # ========        3.)    FILE  MANAGEMENT  ft  UTILITIES        ======= 

305  # 

306  #   This  section  makes  short  work  of  a  lev  useful/routine  tasks 

307  # 


306 


309 
310 

311  # =  =  =  =  : 

312  # 
313 

314  iclean: 

315  rm  $(0BJECTS) 

316 
317 
316 
319 
320 

321  # ==  =  =  : 

322  # 
323 

324  tclean: 

325  del   *.pp 

326  del  *.tal 

327  del   *.trl 

326 
329 

330  #  EOF  matlib.mak 


3.1)  Intel  iPSC/2 


3.2)  Logical  Systems  C 
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B.   NETWORK  INFORMATION  FILES 

hyprcube.nif  This  Network  Information  File  gives  a  fairly  complete  description  of 
the  hardware  configuration  used  to  perform  the  transputer  work. 
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1 

2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
IS 
19 
20 
21 
22 
23 
24 
25 
26 
27 
26 
29 
30 
31 
32 
33 
34 
35 
36 
37 
36 
39 
40 
41 
42 
43 
44 
45 
46 
47 
46 
49 
50 


NETWORK  INFORMATION  FILE 


SOURCE 

VERSION 

DATE 

AUTHOR 

USAGE 

EDITING 


hyprcube .  nif 

1.1 

09  September  1991 

Jonathan  E.  Hartman,  U.  S.  Naval  Postgraduate  School 

Id-net  hyprcube 

replace  'rootcode'  with  the  code  to  run  on  the  root 

replace  'nodecode'  with  appropriate  code(s)  lor  the  nodes 


REFERENCES 


[1]    Initios.    INS  B012  User   Guide   and  Reference  Manual.    Initios   Limited, 
1988,    Fig.    26,    p.    28. 


DESCRIPTION 


Network  Information  File  (NIF)  used  by  Logical  Systems  C  (version  89.1) 
LD-NET  Network  Loader.  This  file  prescribes  the  loading  action  to  take 
place  when  the  'Id-net'  command  is  given  as  in  USAGE  above. 


HARDWARE  PREREQUISITES 


NOTE:  There  are  three  node  numbering  systems:  the  one  created  by  Inmos' 
CHECK  program,  the  Gray  code  labeling,  and  the  NIF  labeling.  Since  all 
three  will  be  used  on  occasion,  I  will  prefix  node  numbers  with  a  C,  G, 
or  N  to  identify  which  system  I  am  using! 

The  IMS  B004  and  IMS  B012  must  be  configured  correctly.  The  B004's  T414 
has  link  0  connected  to  the  host  PC  via  a  serial-to-parallel  converter, 
link  1  connected  to  the  IMS  B012  PipeHead,  link  2  connected  to  the  T212 
[communications  manager  (not  used  here)]  on  the  B012,  and  link  3 
connected  to  the  IMS  B012  PipeTail  (see  [l]).  By  the  way,  link  2  from 
the  B004  goes  to  the  the  ConfigUp  slot  just  under  the  PipeHead  slot 
(this  connects  it  to  the  T212).  Finally,  the  B004's  Down  link  must  run 
to  the  B012's  Up  link. 


-====   SETTING  THE  C004  CROSSBAR  SWITCHES 


Once  you  have  connected  the  hardware  in  the  fashion  mentioned  above, 
the  system  is  ready  to  be  transformed  to  a  hypercube.  Three  codes  by 
Mike  Esposito  are  used  here:   t2.nif,  root.tld,  and  switch. tld.   I  have 
a  batch  file  called  'makecube.bat'  that  performs  a  'Id-net  t2'  also. 

Mike's  code  passes  instructions  to  the  T212  on  the  B012;  which,  in-turn 
tells  the  C004's  how  to  connect  their  switches.  After  the  code  has 
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executed,  the  (very  specific)  configuration  that  we  are  looking  for 
will  exist.   Specifically,  the  following  (output  from  CHECK  /R)  is  what 
this  process  gives  us: 


check  1.21 

* 

Part  rate  Mb  Bt  1 

LinkO 

Linkl 

Link2 

Link3  ] 

0 

T414b-15 

0.09 

0  [ 

HOST 

1:1 

2:1 

3:2  ] 

1 

T800c-20 

0.80 

1  I 

4:3 

0:1 

5:1 

6:0  ] 

2 

T2   -17 

0.49 

1  [ 

C004 

0:2 

C004  ] 

3 

T800c-20 

0.80 

2  I 

7:3 

8:2 

0:3 

9:0  ] 

4 

T800C-20 

0.76 

3  [ 

9:3 

10:2 

11:1 

1:0  ] 

5 

T800d-20 

0.90 

1  [ 

8:3 

1:2 

10:1 

12:0  ] 

6 

T800d-20 

0.76 

0  I 

1:3 

12:2 

7:1 

11:0  ] 

7 

T800d-20 

0.76 

3  I 

13:3 

6:2 

14:1 

3:0  ] 

8 

T800d-20 

0.90 

2  I 

14:3 

15:2 

3:1 

5:0  ] 

9 

T800C-20 

0.77 

o  1 

3:3 

13:2 

15:1 

4:0  ] 

10 

T800d-20 

0.90 

2  I 

16:3 

5:2 

4:1 

15:0  ] 

11 

T800d-20 

0.90 

1  I 

6:3 

4:2 

16:1 

13:0  ] 

12 

T800d-20 

0.77 

0  1 

6:3 

16:2 

6:1 

14:0  ] 

13 

T800d-20 

0.77 

3  I 

11:3 

17:2 

9:1 

7:0  ] 

14 

T800C-20 

0.90 

1  | 

12:3 

7:2 

17:1 

8:0  ] 

15 

T800C-20 

0.90 

2  I 

10:3 

9:2 

8:1 

17:0  ] 

16 

T800C-20 

0.76 

3  I 

17:3 

11:2 

12:1 

10:0  ] 

17 

T800d-20 

0.88 

2  I 

15:3 

14:2 

13:1 

16:0  ] 

Here  node  CO  is  the  root  transputer  (on  the  IMS  B004)  and  node  C2  is 
the  T212  (on  the  IMS  B012).  The  other  sixteen  nodes  are  the  T800's 
that  are  used  for  the  work.  A  logical  interconnection  topology  is 
described  below. 


TOPOLOGY 


The  physical  interconnection  scheme  described  above  is  an  actual  4-cube 
with  one  exception.  The  root  node  (CO)  is  situated  BETWEEN  nodes  CI 
and  C3  (which  would  be  connected  directly  in  the  usual  4-cube) .  This 
gives  us  two  3-cubes:  one  whose  node  labeling  is  GOxxx  and  the  other, 
whose  node  labeling  is  Glxxx  (where  the  xxx  represents  all  permutations 
of  3-bits) .  These  are  the  usual  three  cubes,  and  they  will  exist  if  we 
define  the  node  numbering/labeling  correctly. 


STRATEGY 


The  node  labeling  established  by  the  IIF  is  available  via  the  variable 
_node_number  (see  <conc.h>)  in  source  code.  Therefore,  we  would  like  a 
smart  labeling  scheme  in  the  MIF  file  so  that  programming  is  easier. 
This,  of  course,  is  subject  to  the  restriction  that  IIF  labels  begin 
with  Ml  and  so  on. 
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101 

102 
103 
104 
105 
106 
107 
108 
109 
110 
111 
112 
113 
114 
115 
116 
117 
118 
119 
120 
121 
122 
123 
124 
125 
126 
127 
128 
129 
130 
131 
132 
133 
134 
135 
136 
137 


One  such  method  would  be  to  define  a  IIF  labeling  so  that  the  Gray  code 
label  lor  a  node  would  be  (_node_number  -2).   In  fact,  this  is 
possible  and  the  adjacencies  defined  below  allow  us  to  realize  this 
feature.   Below,  node  NO  is  the  host  PC,  node  II  is  the  root  transputer 
(T414  on  the  B004) ,  12  through  117  correspond  to  GO  through  G15  (the 
nodes  of  a  4-cube) ,  and  118  is  not  used  (but  it's  the  T212) . 


host.server  cio.exe;     (default) 


I0DE 
ID 

1, 

2, 

3, 

4, 

5, 

6, 

7, 

8, 

9. 

10, 

11, 

12, 

13, 

14, 

15, 

16, 

17, 

18, 


TRANSPUTER 

LOADABLE 
CODE  (.tld) 

rootcode, 
nodecode , 
nodecode, 
nodecode, 
nodecode, 
nodecode , 
nodecode, 
nodecode, 
nodecode, 
nodecode, 
nodecode, 
nodecode, 
nodecode , 
nodecode, 
nodecode, 
nodecode, 
nodecode, 
switch, 


RESET 
COMES 
FROM: 

r0, 

rl, 

r2, 

r5, 

r3, 

r7, 

r9, 

r4, 

r8, 

rll, 

rl3, 

rl6, 

rl2, 

r6, 

rl4, 

rl7, 

rl5, 

si, 


DESCRIPTION  OF  LINK  CONNECTIONS 


LINK0 

0. 
4, 

U. 

12, 
9, 
2. 
3, 
6, 

17. 

14, 

15. 

10, 
6. 

16. 
7, 
8. 

13, 


LINK1    LINK2 


LINK3 


EOF  hyprcube.nif 


s  =  =  =  = 

===== 

2. 

» 

10 

B004 

1. 

3. 

6 

B012 

2, 

5, 

7 

5. 

8. 

2 

3, 

4. 

13 

7. 

14. 

8 

9, 

6. 

15 

4. 

9, 

16 

8. 

7. 

5 

U. 

1. 

12 

13, 

10. 

3 

16, 

13, 

4 

12. 

11. 

17 

6, 

15, 

10 

14, 

17. 

11 

17, 

12. 

14 

15, 

16, 

9 

1. 

> 

T212 
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C.   STANDARD  FILES 

macros. h  This  header  file  gives  several  C  macros  that  are  used  in  other  programs. 
matrix. h  This  header  file  establishes  the  standard  definition  of  a  matrix. 


225 


macros,  n 


l 
2 
3 

A 

5 

6 

7 

8 

9 

10 

11 

12 

13 
14 
If. 
16 
17 
18 
19 
20 
21 
22 


SOURCE 
VERSION 
DATE 
AUTHOR 


/♦  - 

* 
* 
* 
* 
*  - 


#define  MAX(x.y) 
•define  MIN(x.y) 
#deiine  P0W2(n) 


PROGRAM  INFORMATION 


macros .h 

1.3 

14  September  1991 

Jonathan  E.  Hartman,  U.  S.  Naval  Postgraduate  School 


(((x)  >  (y))  ?  (x)  :  (y)) 
C((x)  >  (y))  ?  (y)  :  (x)) 
((1)  «  (n)) 


/* 


EOF  macros .h 


*/ 
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matrix. h  

==========    PROGRAM  INFORMATION 


SOURCE 
VERSION 
DATE 
AUTHOR 


matrix .h 

2.0 

02  September  1991 

Jonathan  E.  Hartman,  U.  S.  laval  Postgraduate  School 


==============   DESCRIPTION   =============== 

A  header  file  for  a  family  of  functions  designed  to  work  with  matrices. 


/ 


i   /< 

2 

3 

4 

5 

6 

7 

8 

9 
10 
11 
12 
13 
14 
15 

16  #include   "complex. h"  /*     for  Complex_Type      */ 

17 
16 
19 
20   /* ==========        MANIFEST  CONSTANTS        ============ */ 

21 
22 

23  #define  BASE.TEN  10 

24  #define  CURRENT  1 

25  #ifndef  EXIT_FAILURE 

26  #define  EXIT.FAILURE  1 

27  tendif 

26  #ifndef  EXIT.SUCCESS 

29  #define  EXIT.SUCCESS  0 

30  #endif 

31  #define  FAILURE  1 

32  #define  FALSE  0 

33  #define  LINE_LENGTH  80 

34  #define  MAX_NAME_LENGTH  80 

35  #def ine  NO  0 

36  tdefine  OFF  0 

37  «define  ON  1 

38  #define  ONE.BYTE  1 

39  #define  0NE_MEMBER  1 

40  #define  PREVIOUS  0 

41  #define  SUCCESS  0 

42  tdefine  TRUE  1 

43  #define  TYPE.CHAR  0 

44  #define  TYPE_DOUBLE  1 

45  #define  TYPE_FLOAT  2 

46  #define  TYPE.INT  3 

47  tdefine  YES  1 

46 

49 

50 
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/* 


TYPE  DEFINITIONS 


*/ 


typedef  struct  { 

char     *name; 
int     rows, 
cols; 
double  **matrix; 

}  Matrix_Type; 


typedef  struct  { 

char         *name; 
int  rows, 

cols ; 
Complex_Type  **matrix; 

}  Complex_Matrix_Type; 


typedef  struct  { 

char    *name; 
int     rows, 
cols; 
double  **matrix; 

}  Double_Matrix_Type; 


typedef  struct  { 

char    *name; 
int     roes, 
cols; 
float   **matrix; 

>  Float_Matrix_Type; 


typedef  struct  { 
char    *name; 


/*  default/standard  is  type  double   */ 


/*   type  Complex.Type  is  in  complex. h  */ 
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101  int  rows, 

102  cols; 

103  int  **matrix; 

104 

105  }   Int_Matrix_Type; 

106 

107 

108  /* ============        EOF     matrix. h 


♦/ 
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D.   SOURCE  CODE  FILES 

There  is  one  header  file  and  one  (.c)  source  code  file  for  each  remaining  member 
of  the  library,  so  the  filename  is  given  without  the  suffix. 

allocate  Memory  allocation  and  management  functions. 

clargs  For  processing  command-line  arguments. 

comm  Communications  functions  for  the  hypercubes. 

complex  Complex  numbers  and  operations. 

epsilon  Machine  precision  functions. 

generate  Matrix  generation  functions. 

io  Input/output  (10)  functions. 

mathx  A  small  extension  to  the  C  math  library. 

num_sys  Various  number  systems  (binary,  decimal,  hexadecimal). 

ops  Matrix  and  vector  operations. 

timing  Functions  for  timing. 

Again,  however,  most  of  the  source  code  has  been  omitted  and  only  the  header 
files  remain.  The  singular  exception  is  complex. c  because  this  source  contains  an 
algorithm  referenced  earlier  in  the  thesis. 
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PROGRAM     INFORMATION 


SOURCE 
VERSION 
DATE 
AUTHOR 


allocate .h 

2.0 

09  September  1991 

Jonathan  E.  Haxtman,  U.  S.  Naval  Postgraduate  School 


====____=-=_==    DESCRIPTION    ============= 

Declarations  of  functions  associated  with  memory  allocation. 


LIST  OF  FUNCTIONS 


cmatallocO 

intvecallocO 

matallocO 


FUNCTION  DECLARATION 


PURPOSE:    This  function  performs  the  memory  allocation  for  a  matrix 

structure  (of  the  Complex_Matrii_Type)  using  the  C  function 
calloc().   Additionally,  it  fills  the  "rows"  and  "cols" 
fields  of  the  matrix  structure  returned  with  the  parameters 
passed  to  the  function.  If  a  structure  is  returned  (see 
"RETURNS"),  then  its  "rows"  and  "cols"  fields  will  be 
filled  with  the  correct  values.  The  structure  type  is 
defined  in  "matrix. h". 

INCLUDE:    "allocate. h" 

CALLS:      callocO 

CALLED  BY: 


PARAMETERS:  int  rows 
int  cols 


the  number  of  rows  in  the  desired  matrix 
the  number  of  columns  in  the  desired  matrix 


RETURNS:  A  pointer  to  the  structure  if  successful;  MULL  otherwise. 
The  HULL  case  includes  non-positive  rows  or  cols  in  addi- 
tion to  the  obvious  allocation  failure. 
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EXAMPLE:     Complex_Matrix_Type  *A; 
A  =  cmatalloc(7,  7); 


/ 
#ifdef  PROTOTYPE 

Complex_Matrix_Type  *cmatalloc(int  rows,  int  col6); 
#else 

Complex_Matrix_Type  *cmatalloc() ; 
#endif 


FUNCTION  DECLARATION 


PURPOSE:    This  function  performs  the  memory  allocation  for  a  vector, 
v,  of  num_elements  integer  elements. 

INCLUDE:  "allocate .h" 

CALLS:  callocO 

CALLED  BY: 

PARAMETERS:  See  PURPOSE. 

RETURNS:  A  pointer  to  the  array  if  successful  and  NULL  otherwise. 

EXAMPLE:    int  desired_size_of _v  =  7, 
*v; 

v  =  intvecalloc(desired_size_of_v) ; 

/ 

#ifdef  PROTOTYPE 

int  *intvecalloc(int  num_elements) ; 
#else 
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int  *intvecalloc() ; 
#endif 


FUNCTION  DECLARATION 


PURPOSE:    This  function  performs  the  memory  allocation  for  a  matrix 
structure  using  the  C  function  calloc().   Additionally,  it 
fills  the  "rows"  and  "cols"  fields  of  the  matrix  structure 
returned  with  the  parameters  passed  in  to  the  function. 
If  a  structure  is  returned  (see  "RETURNS"),  then  its  "rows" 
and  "cols"  fields  will  be  filled  with  the  correct  values. 
The  structure  type  is  defined  in  "matrix. h". 

INCLUDE:     "allocate. h" 

CALLS:       callocO 

CALLED  BY: 


PARAMETERS: 


RETURNS: 


EXAMPLE: 


int  rows 
int  cols 


the  number  of  rows  in  the  desired  matrix 
the  number  of  columns  in  the  desired  matrix 


A  pointer  to  the  structure  if  successful;  MULL  otherwise, 
The  NULL  case  includes  non-positive  rows  or  cols  in  addi- 
tion to  the  obvious  allocation  failure. 

Double_Matrix_Type  *A  =  matalloc(7,  7); 


/ 
#ifdef  PROTOTYPE 

Double_Matrix_Type  *matalloc(int  rows,  int  cols); 
#else 

Double_Matrix_Type  *matalloc(); 
iendif 


/* 


EOF  allocate. h 


*/ 


233 


clargs.h 


3  *  SOURCE 

4  *  VERSION 

5  *  DATE 

6  *  AUTHOR 


1  /* ==  =  =  =  =  =  =  =  =    PROGRAM  INFORMATION    ===  =  =  =  =  === 

2  * 
clargs . h 
1.5 

09  September  1991 
Jonathan  E.  Hartman,  U.  S.  laval  Postgraduate  School 

7  * 

8  * 

9  * ___=======__==    DESCRIPTION    ============= 

10  * 

n  *  This  header  file  gives  the  declarations  to  accompany  clargs. c.   These 

12  *  files  provide  a  standard  (if  somewhat  limited)  way  of  handling  command- 

13  *  line  arguments.   The  objective  is  to  handle: 

14  * 

15  *  1.)  Simple  boolean  arguments  like  "if  -v  exists,  set  verbose  =  TRUE". 

16  *  We  will  call  such  an  argument  a  'simple'  argument  type.  This 

17  *  type  °*  argument  can  be  recognized  by  the  fact  that  it  has  no 

18  *  sub-arguments  (the  sub-argument  count,  subargc  ==  0). 

19  * 

20  *  2.)  Arguments  with  sub-arguments  to  be  interpreted  as  numbers.  We 

21  *  will  this  a  'complex'  argument  type.  Suppose  that  we  want  to  set 

22  *  int  dim  =  3  when  the  command  line  arguments  contain  "-d  3  ". 

23  *  This  case  implies  several  requirements: 

24  * 

25  *  a.)  First,  we  must  know  in  advance  how  many  sub-arguments  the 

26  *  argument  has — we'll  call  this  subargc  (in  this  case  we  are 

27  *  expecting  one  sub- argument ,  so  the  caller  would  have  set 

28  *  subargc  =  1). 

29  * 

30  *  b.)  Secondly,  we  must  know  how  to  interpret  each  sub-argument 

3i  *  [i.e.,  what  type  is  the  sub-argument?  Is  it  a  double  or  long 

32  *  (float  and  int  can  be  handled  by  type  casting)?] 

33  * 

34  *  We  will  call  this  kind  of  argument  a  complex  argument  type.  They 

35  *  can  be  recognized  as  those  with  subargc  >  0. 

36  * 

37  *  Here  is  the  strategy.   The  user  makes  a  list  of  valid  command-line 

38  *  arguments  by  creating  an  array  of  pointers  to  structures  of  type 

39  *  Arg_Struct .   We  will  call  this  the  option  list,  (Arg_Struct  *)  optv[]. 

40  *  The  code  assumes  that  you  can  do  something  like  this  at  the  top  of  your 

41  *  source: 

42  * 

43  *  #define  MAX_NUMBER_0F_ARGS  3 

44  * 

45  *  static  Arg_Struct   *optv[MAX_NUMBER_0F_ARGS]  ; 

46  » 

47  *  Let  (int)  optc,  be  the  option  count  (number  of  options).   Every  element 

48  *  in  (pointed  to  by)  the  option  list  is  a  structure  of  type  Arg_Struct 

49  *  defined  below.   By  using  the  standard  C  argc  and  argv;  and  by  creating 
so  *  and  passing  optc  and  optv  around,  we  can  manipulate  command-line 
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arguments  just  about  however  we  want.   The  next  step  is  to  understand 
the  structure. 


install_complex_arg() 
install_simple_arg() 
interpret_args ( ) 


LIST  OF  FUNCTIONS 


/♦ 


MANIFEST  CONSTANTS 


*/ 


#ifndef 

#def ine 

#endif 

#ifndef 

#def ine 

#endif 

#ifndef 

#def ine 

#endif 

tifndef 

#def ine 

#endif 

#ifndef 

#def ine 

#endif 

#ifndef 

#def ine 

#endif 


EXIT.FAILURE 
EXIT.FAILURE 

EXIT.SUCCESS 
EXIT.SUCCESS 

FALSE 
FALSE 

HULL 
NULL 

SUCCESS 
SUCCESS 

TRUE 
TRUE 


/* 

* 
* 

*/ 


The  maximum  number  of  characters  in  an  argument  name,  NAX_ARGLEN  is  a 
relatively  arbitrary  thing. .. .make  it  whatever  you  want.   The  DOUBLE 
and  LONG  manifest  constants  are  assumed  to  be  used  for  values  of 
subargi  (see  the  structure  below) . 


#define  MAX.ARGLEN 
tdefine  DOUBLE 
#define  LONG 
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DATA  STRUCTURES 


argname     The  (string)  name  of  a  valid  argument.   For  instance,  if 

you  want  the  simple  argument  "-v",  then  argname []  would  be 
"-v".   If  you  have  a  complex  argument  that  sill  appear  as 
"-number  3  4.5  6.7",  then  argname  will  be  "-number"  and  you 
must  use  the  sub-argument  variables  below  to  handle  the 
integer  and  two  floating-point  values. 

subargc  Consider  the  "-number"  example  again.  There  are  three  sub- 
arguments  (3,  4.5,  and  6.7)  so  the  sub-argument  count  would 
be  3. 

subargiC]   This  array  tells  us  how  to  interpret  the  subarguments .   For 
instance,  again  using  the  "-number"  example  above,  we  would 
set  subargi[0]  =  LONG;  subargi[l]  =  DOUBLE;  and 
subargi[2]  =  DOUBLE. 

found       This  should  is  initialized  to  FALSE.   The  function 

interpret_args()  will  set  this  field  TRUE  if  the  argname[] 
appears  on  the  command-line  (in  *argv[]). 

dsa[]       This  field  is  an  array  of  double  sub-arguments. 

lsa[]       This  field  is  an  array  of  long  sub-arguments. 

Consider  the  "-number"  example  again.   After  argument  resolution,  we 
would  find  that  dsa[0]  is  not  defined  since  subargi[0]  ==  LONG. 
However,  we  can  use  subargi[]  to  verify  that  subargiCl]  and  subargi[2] 
are  DOUBLE.   Knowing  this,  we  can  safely  presume  that  the  values  with 
CORRESPONDING  index  in  dsaD  should  be  interpreted  as  doubles.   That 
is,  dsa[l]  will  be  a  double  value  (4.5)  and  dsa[2]  will  also  be  a 
double  (6.7).   In  a  similar  manner,  lsa[0]  must  be  a  long  (3)  and 
lsa[l]  and  lsa[2]  are  not  defined. 


typedef  struct  { 

char   argname [MAX_ARGLEN] ; 


mt 


subargc , 

♦subargi, 

found; 


double  *dsa; 
long   *lsa; 

>  Arg.Struct; 


/*  how  many  subarguments  expected  */ 

/*  how  to  interpret  subarguments  */ 

/*  set  TRUE  if  the  argument  is  found  */ 

/*  double-valued  sub-arguments  */ 

/*  long-valued  sub-argument  list  */ 
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FUNCTION  DECLARATION 


PURPOSE:    To  install  a  valid  complex  argument  in  the  option  list, 
optv[]  . 

IICLUDE:    "clargs.h" 

CALLS:       strcpyO 

CALLED  BY: 

PARAMETERS:  int        index; 

Arg_Struct  *optv[]; 
const  char  *argname; 
int        *interpret, 
subargc; 

The  first  three  parameters  are  exactly  like  the  corresponding  ones  for 
install_simple_arg() .   Additionally,  for  complex  arguments,  we  need  to 
pass  in  instructions  concerning  how  many  sub-arguments  there  are  (i.e., 
subargc)  and  how  to  interpret  each.   The  array  interpretG  should  be 
filled  with  subargc  elements  when  you  call  this  function.   The  elements 
should  only  be  valid  ones  (e.g.,  DOUBLE,  LONG). 


#ifdef  PROTOTYPE 

void  install_complex_arg(int  index,  Arg_Struct  *optv[], 

const  char  *argname,  int  *interpret, 
int  subargc) ; 
#else 

void  install_complex_arg() ; 

#endif 


/* 

*  PURPOSE: 

*  INCLUDE: 


FUNCTION  DECLARATION 


To  install  a  valid  simple  argument  in  the  option  list, 
optv[]  . 

"clargs.h" 
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CALLS:  strcpyO 

CALLED  BY: 

PARAMETERS:  int        index; 

Arg_Struct  *optv[]; 
const  char  *argname; 

The  'index'  gives  the  location  of  the  option  in  the  option  list, 
optv[].   The  function  uses  this  index  to  install  the  argname  at  the 
proper  location  in  optv[].   For  instance,  set  this  variable  to  zero  for 
the  first  option  in  the  list.   Normal  C  indexing  convention  applies; 
namely,  0  <=  index  <  MAX_NUMBER_OF_ARGS .   The  'argname'  is  the  string 
that  you  want  recognized  as  a  valid  argument.   For  instance,  suppose 
that  you  want  a  timing  argument  to  be  recognized  whenever  "-t"  appears 
on  the  command  line.   Then  you  would  supply  "-t"  in  this  place. 


201 

202 
203 
204 
205 
206 
207 
208 
209 
210 
211 
212 
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216 
217 
216 
219 
220 
221 

222  #ifdef  PROTOTYPE 

223 

224  void  install_simple_arg(int  index,   Arg_Struct  *optv[], 

225  const   char  *argname) ; 

226  #else 

227 

22S  void  install_simple_arg() ; 

229 

230  #endif 

231 

232 

233 

234 

235 

236  / 

237 
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250 
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FUNCTION  DECLARATION 


PURPOSE:    Once  the  user  has  defined  an  appropriate  option  list, 

optv[],  with  optc  options,  this  function  parses  the 
command-line  arguments  (as  given  by  argc  and  argv)  and  fills  the 
*optv[]  structures  appropriately.   For  instance  every  valid  (exists  in 
optv  ==>  valid)  argument  that  appears  on  the  command  line  sill  result 
in  the  corresponding  optv  structure's  'found'  field  being  set  to  TRUE. 
The  function  also  interprets  sub-arguments  and  fills  dsa[]  and/or  lsa[] 
accordingly.   It  assumes  that  the  caller  has  established  the  desired 
argname 's,  subargc's,  and  subargi's. 

INCLUDE:     "clargs.h" 

CALLS:      printfO 
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strcmpO 
strtodO 
strtolO 

CALLED  BY: 

PARAMETERS:  As  described  in  PURPOSE. 


251 

252 

253 

254 

255 

256 

257 

258 

259 

260   */ 

261 

262 

263  #iidef  PROTOTYPE 

264 

265         void   interpret_args(int   argc,    char  **argv,    int  optc,    Arg_Struct   **optv) ; 

266 

267  #else 

266 

269  void  interpret_args() ; 

270 

27i  #endif 

272 
273 
274  /* =============     EOF   clargs.h      ============== */ 
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PROGRAM  INFORMATION 


SOURCE 
VERSION 
DATE 
AUTHOR 


conin.h 

2.5 

14  September  1991 

Jonathan  E.  Hartman,  U.  S.  laval  Postgraduate  School 


DESCRIPTION 


This  header  file  gives  manifest  constants  and  function  specifications 
for  comm.c.   These  files  contain  communication  (and  related)  functions 
for  a  normal  hypercube  topology  and  a  hybrid  topology.   Unfortunately 
the  code  is  a  bit  busy  with  #ifdef 's,  but  the  purpose  of  these  files  is 
to  make  hypercubes  a  little  more  transparent.   This  makes  the  comm.h 
and  comm.c  files  a  bit  hard  to  read,  but  you  should  be  able  to  recoup 
this  loss  when  it  comes  time  to  write  a  particular  application. 


TOPOLOGIES 


The  functions  specified  below  have  been  designed  to  work  on  three  very 
different  machines.   First,  the  Intel  iPSC/2  with  a  normal  hypercube  of 
order  0,  1,  2,  or  3  is  handled.   A  normal  hypercube  of  transputers  is 
next  on  the  list  (also  order  0,  1,  2,  or  3).   Finally,  there  is  a 
hybrid  topology  of  transputers  that  is  handled.   The  normal  hypercubes 
need  almost  no  introduction.   We  have  a  host  or  root  processor/program 
together  with  programs  running  on  the  nodes.   I  will  use  host  and  root 
interchangeably  here,  although  'host'  is  properly  associated  with  the 
Intel  machine  and  'root'  is  the  more  correct/descriptive  term  when  the 
subject  is  transputer  networks.   The  hybrid  topology  deserves  a  more 
careful  introduction. 

The  hybrid  topology  is  a  network  of  Inmos  transputers  (PC  host  with  an 
IMS  B004  board  and  a  T414  linked  to  sixteen  T800  processors  on  an  IMS 
B012  board)  arranged  so  that  the  'root'  is  situated  between  nodes  zero 
and  eight  of  a  4-cube.   This  means  that  nodes  0  and  8  are  I0T  directly 
connected.   The  functions  made  for  this  topology  compensate  for  this 
situation.   Instead  of  trying  to  describe  each  function,  I  will  simply 
remark  that  the  most  natural  way  to  treat  this  problem  is  (more-or- 
less)  as  two  3-cubes  attached  to  the  root.   A  more  careful  description 
of  how  each  problem  is  handled  may  be  found  in  the  code  for  the  parti- 
cular function. 

In  summary,  the  transputer  portions  of  the  code  depend  upon:  (1)  a  very 
specific  hardware  configuration,  (2)  the  appropriate  IIF  file  to 
support  the  usual  Gray  code  in  a  convenient  way 

[  mynodeO  ==  _node_number  -  2  ], 
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and  (3)  a  particular  link  arrangement  like  that  can  be  created  by  Mike 
Esposito's  t2.nif,  root.tld,  and  switch. tld. 

DETAILS:   Look  for  additional  details  in  hyprcube .nil . 


PREREQUISITES 


Before  using  any  of  the  functions  involving  send()  or  receiveO,  the 
host  (or  root)  program  must  initialize_hypercube() .   For  transputer 
applications,  EACH  of  the  I0DES  must  initialize_hypercube()  too,  and 
you  need  to  be  sure  that  a  hypercube  exists  in  hardware  and  that  your 
■IF  describes  a  hypercube  with  the  usual  Gray  code.   You  must  define 
the  global  variables  -(Channel  *ic[],  »oc[];}  because  the  code  depends 
upon  their  existence.   Both  of  these  vectors  must  be  of  length 
(cubesize+1)  as  described  in  the  preface  to  initialize_hypercube() . 

The  cubesize  and  dimension  that  you  use  with  the  transputer  implementa- 
tion determine  the  cube.   Even  though  you  actually  have  sixteen  T800's 
in  the  cube,  the  cubesize  and  dimension  that  you  use  will  determine  the 
portion  that  actually  gets  used.   Mote  that  both  the  usual  hypercube 
and  the  hybrid  4-cube  are  built  upon  the  same  hardware  and  link  setup. 
Many  of  the  functions  declared  below  DEPEND  upon  the  proper  call  to  the 
initialize_hypercube()  function.   To  avoid  difficulty,  observe  the 
guidelines  given  with  this  function!   Additionally,  in  the  transputer 
case,  you  will  need  to  make  sure  that  you  include  <conc.h>. 


LIST  OF  FUNCTIONS 


coalesceO 

cubecast() 

cubecast_f rom() 

directional_exchange() 

directional_receive() 

directional_send() 

hamming_di stance ( ) 

initialize_hypercube ( ) 

leas t_dimens ion () 

link_number() 

linkinQ 

linkoutQ 

receiveO 

sendQ 

submit () 
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101 

102 

103 
104 
105 
106 
107 
106 
109 
110 
111 
112 
113 
114 
115 
116 
117 
118 
119 
120 
121 
122 
123 
124 
125 
126 
127 
126 
129 
130 
131 
132 
133 
134 
135 
136 
137 
136 
139 
140 
141 
142 
143 
144 
145 
146 
147 
146 
149 
150 


/* 


#ifdef  TRANSPUTER 

#define  myhost() 

Odefine  mynodeQ 

#else   /*   iPSC/2  */ 


MACROS  t  MANIFEST   CONSTANTS 


-1 


*/ 


(_node_number  -  2)     /*  depends  upon  <conc.h>  */ 


#define  ALL_N0DES  -1 

♦define  ALL_PIDS  -1 

#define  ANY.N0DE  0 

tdefine  ANY_TYPE  -1 

#define  ARBITRARY_TYPE  0 
#define  KEEP_TIL_RELCUBE   1 

#define  N0DE_PID  0 
#ifndef  NULL 

#define  NULL  0 
#endif 

#endif 


#ifndef  FALSE 
#define  FALSE 
#endii 

#ifndef  TRUE 
#define  TRUE 
#endif 


/*  for  receive(from  any  node,  ...  )     */ 

/*  first  non-force-type  message  */ 

/*  don't  care  */ 

/*  for  getcubeO  */ 

/*  arbitrary  . . .  don't  care  */ 


FUNCTION  DECLARATION 


PURPOSE:    This  function  performs  the  first  step  in  the  opposite  of 
the  cubecastQ  function.  That  is,  this  one  is  used  when 
you  want  to  collect  information  from  the  nodes  in  'higher  dimensions' 
of  the  hypercube  at  the  current  node.  You  may  want  to  perform  some  work 
before  forwarding  this  information  down  to  the  next  lower  dimension,  so 
the  submit ()  function  is  given  separately. 

Like  the  other  functions  in  this  file,  coalesceO  performs  a  somewhat 
different  task  when  executed  in  the  hybrid  4-cube,  so  first  we  will 
discuss  the  usual  hypercubes.   coalesceO  is  a  null  operation  when 
called  from  in  the  highest  dimension  [  if  least_dimension(node)  is 
equal  to  dim  ] .   Otherwise  it  performs  the  communication  to  receive 
from  higher  dimensions  (i.e.,  neighbors  with  larger  node  numbers).   If 
it  is  called  from  the  host/root,  it  attempts  to  receiveQ  from  node 
zero. 
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The  coalesceO  and  submit ()  functions  must  be  balanced  properly  across 
the  nodes.   The  CALLER  must  take  the  necessary  steps  to  be  sure  that 
bul  is  large  enough  to  hold  ((dim  -  least_dimension(node))  *  len) 
bytes.   That  is,  there  sill  be  (dim  -  least_dimension(node) )  copies  of 
the  message  accumulated  at  the  calling  node. 

There  are  several  exceptions  in  the  hybrid  4-cube  topology.   Since  the 
root  is  connected  to  nodes  0000  and  1000,  it  must  make  sure  that  buf 
can  hold  2  copies  of  length,  len.   Then  you  should  think  of  nodes  Oxxx 
as  one  3-cube  and  nodes  lxxx  as  another  (more-or-less  separate)  3-cube . 
That  is,  there  sill  be  no  exchanges  in  the  lxxx  direction  between  them. 
To  determine  the  size  of  buf  at  any  node,  use  the  following  formulae: 


(3  -  least_dimension(node))  *  len, 


lodes  Oxxx 


(3  -  least_dimension(node  -  8))  *  len, 


lodes  lxxx 


CAUTIONS:    If  you  fail  to  allocate  enough  space  for  buf,  you  may  find 
that  your  program  doesn't  work. 

The  transputer  implementation  depends  upon  the  parameter 
'type'  being  set  equal  to  cubesize. 

PREREQUISITE:   initialize_hypercube() 


INCLUDE: 


<conc.h> 
"comm.h" 


(Logical  Systems  C,  version  89.1) 


CALLS:      least_dimension() 
myhostO 
pow2() 
receiveO 


(macro  given  above) 
"mathx.h" 


CALLED  BY: 

EXAMPLE:    Suppose  we  are  'at'  node  0  and  we  want  to  coalesceO  copies 

of  some  object  from  all  of  the  appropriate  nodes.  Let  the 
object  be  of  size  'len'  bytes.   For  concreteness ,  let  the  topology  be  a 
hypercube  of  order  3  (i.e.,  dim  ==  3).  We  would  allocate  a  large  enough 
buf  to  hold  (dim  *  len)  bytes,  since  least_dimension(0)  ==  0.   That  is, 
node  0  will  be  receiving  from  all  neighbors  whose  least_dimension()  is 
greater  [in  this  case,  that  is  ALL  of  its  neighbors];  namely,  1,  2,  and 
4.   After  the  call,  we  would  find  the  data  from  node  1  in  the  first  len 
bytes  of  buf;  the  data  from  2  in  the  middle  len  bytes  of  buf;  and  the 
data  from  4  in  the  final  len  bytes  of  buf.   The  function  is  treated  as 
a  multiple  receiveO,  in  increasing  origin  order,  from  the  appropriate 
neighbors. 

PARAMETERS : 
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201 
202 
203 
204 
205 
206 
207 
208 
209 
210 
211 
212 
213 
214 
215 
216 
217 
218 
219 
220 
221 
222 
223 
224 
225 
226 
227 
226 
229 
230 
231 
232 
233 
234 
235 
236 
237 
238 
239 
240 
241 
242 
243 
244 
245 
246 
247 
248 
249 
250 


int   node   the  coalesce()ing  (receiving)  node 

int   dim    the  dimension  of  the  hypercube 

char  *buf   a  pointer  to  the  beginning  of  the  buffer  where  you  want 

the  message  placed, 
long  len    the  number  of  bytes  to  be  received  from  EACH  node  in 

the  next  higher  dimension  that  sill  be  submit ()ing. 
long  type   the  type  of  the  message  (iPSC/2  applications  only),  or 

cubesize  in  the  transputer  case. 


/ 

fdef  PROTOTYPE 

void  coalesce(int  node,  int  dim,  char  *buf,  long  len,  long  type); 
#else 

void  coalesce(/*  int  node,  int  dim,  char  *buf,  long  len,  long  type  */) ; 
#endif 


FUNCTION  DECLARATION 


PURPOSE:    This  function  is  called  from  the  root/host  and  all  nodes  to 
execute  a  broadcast  to  all  p  nodes.  The  host/root  sends  to 
node  zero  to  start  the  process  off.   Let  lg(n)  denote  log_2(n) .   This 
function  performs  the  communication  in  lg(p)  steps.   For  instance,  node 
zero  receives  from  the  host  in  what  we'll  call  stage  zero.   Then,  in 
stage  1,  node  0  passes  the  message  to  node  1.   In  stage  2,  node  0  sends 
the  message  to  node  2  and  node  1  sends  it  to  node  3.   In  stage  three, 
nodes  0,  1,  2,  and  3  each  send  the  message  to  nodes  4,  5,  6,  and  7 
(respectively) . 

Then,  in  general,  in  stage  i,  the  message  moves  into  the  ith  dimension. 
If  you  prefer,  you  can  think  of  a  pointer  starting  (after  the  message 
arrives  at  node  0)  at  the  rightmost  bit  (LSB)  and  indicating  the  direc- 
tion for  the  next  transmission.   The  pointer  moves  left  until  it 
reaches  the  MSB.   This  is  the  final  stage  of  the  cubecast(). 

The  hybrid  4-cube  is  implemented  by  sending  the  message  from  the  root 
to  nodes  0  and  8  first.   Then  node  0  performs  the  usual  cubecast  for 
the  nodes  that  appear  in  the  usual  3-cube.   lode  8  mirrors  this  action, 
filling  the  other  three-cube  with  labels  like  lxxx. 

In  all  cases,  buf  is  filled  with  an  initial  receiveO  from  the  proper 
node,  and  then  it  is  used  in  retransmissions  to  other  nodes.   In  any 
event,  buf  holds  the  message  after  execution. 
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251 
252 
253 
254 
255 
256 
257 
258 
259 
260 
261 
262 
263 
264 
265 
266 
267 
268 
269 
270 
271 
272 
273 
274 
275 
276 
277 
276 
279 
280 
281 
282 
283 
284 
285 
286 
287 
288 
289 
290 
291 
292 
293 
294 
295 
296 
297 
298 
299 
300 


CAUTION:     The  transputer  implementation  depends  upon  the  parameter 
'type'  being  set  equal  to  cubesize. 


PREREQUISITE:   initialize_hypercube() 


IMCLUDE:     <conc.h> 
"comm.h" 

CALLS:      least_dimension() 
MIX) 
myhost () 
pow2() 
receive() 
send() 


CALLED  BY: 

PARAMETERS: 

int  node 
int  dim 
char  *buf 
long  len 
long  type 


(Logical  Systems  C,  version  89.1) 


(macro  from  macros. h) 
(macro  from  above) 
"mathx.h" 


the  sending  node 

the  dimension  of  the  hypercube 

a  pointer  to  the  head  of  the  message 

the  number  of  bytes  to  be  passed 

the  type  of  the  message  (iPSC/2  applications  only),  or 

cubesize  in  the  transputer  case. 


/ 

#ifdef  PROTOTYPE 

void  cubecast(int  node,  int  dim,  char  *buf,  long  len,  long  type); 
false 

void  cubecast(/*  int  node,  int  dim,  char  *buf ,  long  len,  long  type  */) ; 
#endif 


/«, __=======   FUICTI0R  DECLARATIOM   ========= 

* 

*  PURPOSE:    This  function  is  similar  to  cubecastQ  but  more  general. 

*  Here  we  do  not  assume  that  the  message  starts  at  the  host 

*  or  at  node  zero;  it  may  start  at  any  general  source  node,  src.  In  fact, 

*  it  may  I0T  be  called  from  the  root/host  (use  cubecastO  in  that  case). 
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301 
302 
303 
304 
305 
306 
307 
306 
309 
310 
311 
312 
313 
314 
315 
316 
317 
316 
319 
320 
321 
322 
323 
324 
325 
326 
327 
328 
329 
330 
331 
332 
333 
334 
335 
336 
337 
33ft 
339 
340 
341 
342 
343 
344 
345 
346 
347 
346 
349 
350 


II  dim  is  the  order  of  the  hypercube,  then  src  goes  through  dim  stages, 
passing  the  message  to  its  neighbors.   The  sequence  is  defined  by  an 
X0R  operation  that  starts  at  bit  1  of  src  and  moves  up  through  bit  dim. 
For  instance,  suppose  src  ==  5  ==  101b  in  the  3-cube  (dim  ==  3).   Then 
src  will  first  send  to  (101  I0R  001)  ==  node  4,  next  to  (101  X0R  010) 
==  node  7,  and  finally  to  (101  X0R  100)  ==  node  1.   Meanwhile,  any  time 
that  a  non-source  node  gets  the  message,  he  begins  the  same  process, 
but  only  picks  it  up  at  the  appropriate  stage  (the  one  after  the  stage 
in  which  he  received  the  message). 

PREREQUISITE:   initialize_hypercube() 

IICLUDE:    <conc.h>  (Logical  Systems  C,  version  89.1) 

"comm.h" 

CALLS:      directional_receive() 
direct ional_s end () 
free() 

leas t_dimens ion () 
malloc() 

pow2()  "mathx.h" 

receiveQ 
send() 
sizeof () 

CALLED  BY: 

PARAMETERS: 

int   src  the  source 

int  node  the  number  of  the  node  calling  this  function 

int  dim  the  dimension  of  the  hypercube 

char  *buf  a  pointer  to  the  head  of  the  message 

long  len  the  number  of  bytes  to  be  passed 


/ 

#ifdef  PROTOTYPE 

void  cubecast_from(int  src,  int  node,  int  dim,  char  *buf,  long  len); 
#else 

void  cubecast_f rom() ; 

#endif 
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351 
352 
353 
354 
355 
356 
357 
358 
359 
360 
361 
362 
363 
364 
365 
366 
367 
368 
369 
370 
371 
372 
373 
374 
375 
376 
377 
376 
379 
360 
361 
362 
363 
364 
365 
366 
367 
366 
369 
390 
391 
392 
393 
394 
395 
396 
397 
396 
399 
400 


FUMCTIOM   DECLARATIOB 


PURPOSE:  To   perform   an   exchange   along  a  prescribed  direction.      The 

direction  is  given  as  an  integer  in  {1,  2,  4,  8, . . . ,2"dim} 
This  is  because  the  direction  is  really  a  bit  mask  for  the  Gray-coded 
node  numbers.  For  instance,  if  you  perform  a  directional_exchange() 
from  node  ==  3  ==  011  in  the  3-cube  along  direction  ==  4  ==  100,  this 
i6  the  same  as  performing  a  coordinated  send()  and  receive()  combina- 
tion with  node  (011  I0R  100  ==  111  ==  7).  Care  is  taken  to  make  sure 
that   deadlock  does  not   occur. 


PREREQUISITE:      initialize_hypercube() 


IICLUDE: 


CALLS: 


CALLED  BY: 


PARAMETERS: 


<conc .h> 
"comm.h" 

pos2() 

receiveO 

send() 


(Logical   Systems   C,    version   89.1) 


"mathx .h" 


int  node 
int   dim 
int  direction 
char  *ibuf 
char  *obuf 
long  len 


the  number  of  the  node  calling  this  function 

the  dimension  of  the  hypercube 

as  described  above  (1,  2,  4,  8,  etc.) 

a  pointer  to  the  head  of  the  incoming  message 

a  pointer  to  the  head  of  the  outgoing  message 

the  number  of  bytes  to  be  passed 


#ifdef  PROTOTYPE 

void  directional_exchange(int   node,    int  dim,    int   direction, 

char   *ibuf,    char   *obuf,    long   len); 

false 

void  directional_exchange() ; 
#endif 
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401 
402 
403 
404 
405 
406 
407 
408 
409 
410 
411 
412 
413 
414 
415 
416 
417 
418 
419 
420 
421 
422 
423 
424 
425 
426 
427 
428 
429 
430 
431 
432 
433 
434 
435 
436 
437 
438 
439 
440 
441 
442 
443 
444 
445 
446 
447 
448 
449 
450 


FUNCTION   DECLARATION 


PURPOSE:  To  receive  from  a  prescribed  direction.      The  direction  is 

as  described   in  directional_exchange()   above. 


PREREQUISITE :      initialize_hypercube ( ) 


INCLUDE: 


CALLS: 


CALLED  BY: 


PARAMETERS: 


<conc .h> 
"comm.h" 

pow2() 
receiveQ 


(Logical   Systems   C,    version  89.1) 


"mathx.h" 


int  node  the  number  of  the  node  calling  this  function 

int  dim  the  dimension  of  the  hypercube 

int   direction  direction  to  receive  from 

char  *buf  a  pointer  to  the  head  of  the  message 

long  len  the  number  of  bytes  to  be  passed 


#ifdef  PROTOTYPE 

void  directional_receive(int  node,  int  dim,  int  direction, 

char  *buf ,  long  len) ; 

#else 

void  directional_receive() ; 

#endif 


FUNCTION  DECLARATION 


PURPOSE:    To  send  in  a  prescribed  direction.   The  direction  is  as 
described  in  directional_exchange()  above. 


PREREQUISITE :   init  ialize_hypercube ( ) 


INCLUDE: 


<conc.h> 
"comm.h" 


(Logical  Systems  C,  version  89.1) 


24S 


cnmni. 


451 
452 
453 
454 

455 
456 
457 
458 
459 
460 
461 
462 
463 
464 
465 
466 
467 
466 
469 
470 
471 
472 
473 
474 
475 
476 
477 
478 
479 
480 
481 
482 
483 
484 
485 
486 
487 
488 
489 
490 
491 
492 
493 
494 
495 
496 
497 
498 
499 
500 


CALLS: 


CALLED  BY: 


PARAMETERS 


P0B2() 

send() 


"mathx.h' 


int  node 
int   dim 
int   direction 
char  *buf 
long  len 


the  number  of  the  node  calling  this  function 

the  dimension  of  the  hypercube 

direction  to  send  to 

a  pointer  to  the  head  of  the  message 

the  number  of  bytes  to  be  passed 


#ifdef  PROTOTYPE 

void  directional_send(int  node,  int  dim,  int  direction, 

char  *buf ,  long  len); 

#else 

void  directional_send() ; 

#endif 


=========    FUNCTION  DECLARATION    ========= 

PURPOSE:    To  give  the  Hamming  distance  between  i  and  j. 

INCLUDE:     "comm.h" 

CALLS:      sizeofO 

CALLED  BY: 

PARAMETERS:  int   i,  j   the  numbers 

RETURNS:     (int)  the  Hamming  distanced,  j)  .   That  is,  the  number  of 
ones  in  the  binary  exclusive  OR  (i  X0R  j). 


249 


comm.h 


501 
502 
503 
504 
505 
506 
507 
506 
509 
510 
511 
512 
513 
514 
515 
516 
517 
516 
519 
520 
521 
522 
523 
524 
525 
526 
527 
526 
529 
530 
531 
532 
533 
534 
535 
536 
537 
536 
539 
540 
541 
542 
543 
544 
545 
546 
547 
546 
549 
550 


#ifdef  PROTOTYPE 

int  hamming_distance(int  i,  int  j); 
telse 

int  hamming_distance(/*  int  i,  int  j  */) ; 
#endif 


/ 


FUNCTION  DECLARATION 


PURPOSE:    The  initialize_hypercube()  function  creates  the  hypercube 

and  performs  the  required  setup  for  communications.  It 
must  be  completed  before  you  expect  to  communicate.   On  the  iPSC/2, 
ONLY  the  host  code  should  call  this  function.   For  transputer  implemen- 
tations every  node  should  call  it  (in  addition  to  the  root  node) .   This 
is  prerequisite  to  most  of  the  other  functions  in  this  file.   The  basic 
requirements  for  this  function  are  so  different  (machine  dependent) 
that  there  are  two  versions:   one  for  the  transputers  and  one  for  the 
iPSC/2  machine. 


INCLUDE: 


CALLS: 


"comm.h" 

attachcubeO 

callocO 

freeQ 

getcubeQ 

linkinQ 

linkout() 

loadQ 

malloc() 

printf () 

setpidQ 

sizeof () 

strcpyO 


(Intel  iPSC/2  C  Library) 


(Intel  iPSC/2  C  Library) 


(Intel  iPSC/2  C  Library) 


(Intel  iPSC/2  C  Library) 


CALLED  BY: 

PARAMETERS:  In  both  cases,  the  desired  dimension  of  the  hypercube  is 

passed  in  as  the  first  argument.  After  this,  the  functions 
are  quite  different. 


(1)  iPSC/2 


char  *nodecode  A  pointer  to  the  filename  of  the  nodecode  is 

required  so  that  the  function  can  load  the  node 
program. 
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55] 
552 
553 
554 
555 
556 
557 
556 
559 
560 
561 
562 
563 
564 
565 
566 
567 
566 
569 
570 
571 
572 
573 
574 
575 
576 
577 
578 
579 
580 
581 
582 
563 
584 
585 
566 
587 
586 
569 
590 
591 
592 
593 
594 
595 
596 
597 
596 
599 
600 


(2)  transputers  

Channel  *ic [(CUBESIZE  +  1)]  This  is  the  incoming  channel  list. 
You  nust  declare  it  globally.   Let  CUBESIZE  be  the  number  of 
transputers  in  the  hypercube.  Then  ic[]  is  a  vector  of  length 
(CUBESIZE  +  1).   The  indexing  is  such  that  (ic[n]  ==  C) ,  where 
n  is  some  neighbor  and  C  is  the  incoming  Channel*  from  n.  For 
instance,  if  node  k  finds  that  ic[n]  ==  LINK1IN  then  node  k 
knows  to  receive  messages  from  node  n  via  LINK1IN.   The  element 
ic [CUBESIZE]  holds  the  channel  for  the  root  node  (if  any). 
ic[n]  ==  MULL  means  that  there  is  no  connection  to  node  n. 

Channel  *oc [(CUBESIZE  +  1)]  is  the  outgoing  channel  list.   It 
is  completely  analogous  to  ic[]  except  that  it  will  hold 
LIMK00UT,  LINK10UT,  LINK20UT,  or  LINK30UT  for  the  appropriate 
node  index.   Your  only  obligation  is  to  define  these  lists  as 
global s  in  the  manner  shown.  The  Channel  pointer  elements  will 
be  filled  in  by  initialize_hypercube() . 

RETURNS:  The  iPSC/2  version  of  the  function  returns  a  pointer  to  the 
name  of  the  cube.  In  the  transputer  environment,  the  cube- 
name  has  no  meaning,  so  a  void  function  suffices.  For  the 
transputer  environment,  the  single  most  important  task  that 
initialize_hypercube()  performs  is  the  filling  of  ic[]  and 
oc[].  These  vectors  are  used  by  most  of  the  other  communi- 
cations functions. 


/ 

#ifdef  TRANSPUTER 

void  initialize_hypercube(int  dim); 
#else 

char  *initialize_hypercube(/*  int  dim,  char  *nodecode  */) ; 
#endif 


FUNCTION  DECLARATION 


PURPOSE:    This  function,  called  from  any  node  in  the  hypercube, 

returns  the  dimension  of  the  smallest  hypercube  containing 
that  node. 
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IICLUDE:     "comin.h" 

CALLS:      pow2()  "mathx.h" 


CALLED  BY: 

PARAMETERS:  int  node       the  inquiring  node 

RETURNS:     For  an  n-cube  containing  P==2"(n)  processors,  this  function 
is  designed  to  work  lor  nodes  numbered  0  through  (P-l).   If 
the  function  is  called  from  the  root  (host)  node,  there  is  no  guarantee 
as  to  the  returned  value.   If  it  is  called  by  a  valid  node,  it  will 
return  the  dimension  of  the  smallest  hypercube  containing  that  node 
number.   For  instance  least_dimension(0)  ==  0,  least_dimension(l)  ==  1, 
least_dimension(2)  ==  2,  least_dimension(3)  ==  2,  and  least_dimension 
(8)  ==  4. 


/ 


601 

602 

603 

604 

605 

606 

607 

608 

609 

610 

611 

612 

613 

614 

615 

616 

617 

618 

619 

620 

621 

622  #ifdef  PROTOTYPE 

623 

624  int   least_dimension(int   node); 

625 

626  #else 

627 

628  int   least_dimension(/*   int  node  */) ; 

629 

630  #endif 

631 

632 

633 

634 

635  / 

636 

637 

638 

639 

640 

641 

642 

643 

644 

645 

646 

647 

648 

649 

650 


FUNCTION  DECLARATIONS 


PURPOSE:    The  receiveQ  and  send()  functions  declared  below  provide 

communication  to  (from)  a  buffer  pointed  to  by  buf .  The 
volume  of  material  to  send  (receive)  is  indicated  in  bytes  by  the  len 
argument.   The  destination  (origin)  is  given  by  the  first  argument, 
using  a  valid  node  number.   Suppose  you  have  an  n-cube  established  upon 
a  system  with  p  ==  (2"n)  node  processors.   Then  you  should  refer  to  the 
nodes  of  the  hypercube  by  their  node  number,  which  is  a  Gray  coded 
value  in  the  range  [  0,  (p-l)  ].   If  you  are  at  the  root,  of  course, 
you  may  not  communicate  with  the  root  (at  least  not  with  these  func- 
tions); but  if  you  are  at  one  of  the  nodes  of  the  hypercube,  you  may 
communicate  with  the  root  by  using  myhostO  as  the  origin  (or  destina- 
tion) of  your  message.   The  macro  given  above  makes  myhostO  available 
on  the  transputers. 
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651 
652 
653 
654 
655 
656 
657 
656 
659 
660 
661 
662 
663 
664 
665 
666 
667 
666 
669 
670 
671 
672 
673 
674 
675 
676 
677 
676 
679 
660 
681 
682 
683 
684 
665 
686 
687 
688 
689 
690 
691 
692 
693 
694 
695 
696 
697 
696 
699 
700 


Transputers  or  iPSC/27  The  type  parameter  is  only  used  in  the  implied 
sense  with  the  iPSC/2  implementation  [  it  becomes  type  or  typesel  for 
csend()  or  crecv()  ].   For  transputer  implementations,  type  KUST  BE  set 
equal  to  the  number  of  nodes  in  the  hypercube  (e.g.,  p  in  the  example 
above).   1  have  called  this  'cubesize'  in  most  of  my  references. 


PREREQUISITE:   initialize_hypercube() 


IICLUDE: 


CALLS: 


<conc .h> 
"comm .h" 

ChanlnO 
ChanOutO 
crecv() 
csendO 


(Logical  Systems  C,  version  89.1) 

(Logical  Systems  C,  version  89.1) 
(Intel  iPSC/2  C  Library) 


CALLED  BY: 


================   CAUTION   ================ 

Make  sure  type  ==  cubesize  in  the  transputer  case  (see  the  note  above) ! 

/ 

fdef  PROTOTYPE 

void  receive(int  origin,   char  *buf ,  long  len,  long  type); 

void  send(int  destination,  char  *buf ,  long  len,  long  type); 
#else 

void  receive(/*  int  origin,   char  *buf ,  long  len,  long  type  */) ; 

void  send(/*  int  destination,  char  *buf ,  long  len,  long  type  */) ; 
#endif 


FUNCTION  DECLARATION 


PURPOSE:    This  function  is  called  from  the  nodes  to  submit  a  message 

to  the  next  lower  dimension.  If  it  is  called  from  the  host 
(root)  it  has  no  effect.   When  it  is  called  from  node  zero,  the  trans- 
mission is  directed  to  the  root/host.   Vhen  called  from  any  other  node, 
the  information  in  buf  is  passed  to  the  proper  node  in  the  next  loser 
dimension.   The  lower  dimension  must  have  an  accepting  coalesce'')  or 
other  receiving  function  [  coalesceO  and  submitQ  are  meant  to  be  used 
in  a  balanced  fashion,  where  each  submitQ  or  group  of  submitO's  in 
one  dimension  is  matched  by  a  coalesceO  in  the  next  lower  dimension  ] . 
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701 
702 
703 
704 
705 
706 
707 
708 
709 
710 
711 
712 
713 
714 
715 
716 
717 
716 
719 
720 
721 
722 
723 
724 
725 
726 
727 
728 
729 
730 
731 
732 
733 
734 
735 
736 
737 
738 
739 
740 
741 
742 
743 
744 
745 
746 
747 
748 
749 
750 


PREREQUISITE :   init  ialize_hypercube ( ) 


IICLUDE: 


CALLS: 


CALLED  BY: 


<conc .h> 
"comm.h" 

leas t_dimens ion () 
pow2() 

send() 


(Logical  Systems  C,  version  89.1) 


"mathx.h1 


EXCEPTIONS:  Again,  we  have  the  hybrid  hypercube  in  the  transputer  case 
(see  many  comments  above).  The  general  rule  is  changed  in 
this  case  since  node  1  submit()s  to  the  root  and  not  node  0.   This  is 
the  only  change. 

SPECIFICS:   If  you  need  to  determine  exactly  where  a  submitO  will  go, 

you  can  figure  it  out  in  the  following  manner  [  with  the 
obvious  EXCEPTIONS  (the  previous  paragraph)  ]   .... 

Suppose  you  are  'at'  node  i  in  an  n-cube  (p  processors  =  2"n) .   You 
must  submitO  information  to  the  (unique)  node,  j,  that  satisfies  two 
requirements : 

(1)  hamming_distance(i,  j)  ==  1 

(2)  least_dimension(i)  ==  (least_dimension(j)  +  1) 

So,  for  instance,  consider  a  4-cube  where  i  ==  12.   It  should  be  fairly 
easy  to  see  that  j  will  be  node  4.   This  is  because  these  two  nodes  are 
adjacent  and  they  are  one  dimension  apart  in  the  cube  (i.e.,  node  4 
first  appears  in  a  3-cube  and  node  12  first  appears  in  a  4-cube). 

PARAMETERS : 

int  node   the  sending  node 
int  dim    the  dimension  of  the  hypercube 
char  *buf   a  pointer  to  the  head  of  the  message 
long  len    the  number  of  bytes  to  be  passed 

long  type   the  type  of  the  message  (iPSC/2  applications  only),  or 
cubesize  in  the  transputer  case. 


/ 

#ifdef  PROTOTYPE 

void  submit (int  node,  int  dim,  char  *buf,  long  len,  long  type); 
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751 

752   #else 

753 

754  void   submit  (/*   int  node,    int  dim,    chair  »buf ,    long  len,    long  type   */); 

755 

756  #endif 

757 
756 

759  /* ===  ===========  EOF   comm.h  ============== ♦/ 
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DESCRIPTION 


This  file  contains  the  definition  of  Complex_Type  and  declarations  of 
functions  that  perform  operations  with  complex  numbers: 

cadd() 

cdiv() 

cmul() 

csub() 

Im() 

Re() 


/* 


TYPE  DEFINITION 


*/ 


typedef  struct  { 
double   x, 

y; 

}  Complex.Type; 


/*  real  part       */ 
/*   imaginary  part   */ 


/♦ 


FUNCTION  DECLARATION 


*   PURPOSE:    To  add  two  complex  numbers,  zl  and  z2,  and  place  their  6um 
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51 
52 
53 
54 
55 
56 
57 
58 
59 
60 
61 
62 
63 
64 
65 
66 
67 
66 
69 
70 
71 
72 
73 
74 
75 
76 
77 
76 
79 
80 
61 
82 
83 
64 
85 
86 
87 
88 
89 
90 
91 
92 
93 
94 
95 
96 
97 
96 
99 
100 


INCLUDE: 


in  the  Complex_Type  '*sum'. 
"complex .h" 


PARAMETERS:  The  parameters  give  the  two  operands  zl  and  z2,  and  a 
pointer  to  the  result,  sum. 

EXAMPLE:     Complex_Type  zl,  z2,  z3; 

cadd(zl,  z2,  *z3) ; 

/ 
tifdef  PROTOTYPE 

void  cadd(Complex_Type  zl,  Complex_Type  z2,  Complex_Type  *sum); 
false 

void  cadd() ; 
#endif 


FUNCTION  DECLARATION 


PURPOSE:     To  divide  two  complex  numbers,  (zl  /  z2) ,  and  place  the 
result  in  the  Complex_Type  '^quotient'. 

ALGORITHM:   The  code  uses  Smith's  formula  (page  25  of  [l])  to  perform 
the  division. 

INCLUDE:     "complex. h" 

PARAMETERS:  The  parameters  give  the  two  operands  zl  and  z2,  and  a 
pointer  to  the  result,  quotient. 

EXAMPLE:     Complex.Type  zl,  z2,  z3; 

cdiv(zl,  z2,  *z3); 


#ifdef  PROTOTYPE 
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123 
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147 
148 
149 
150 


void  cdiv(Complex_Type  zl,  Complex_Type  z2,  Complex_Type  *quotient); 
#else 

void  cdiv() ; 
•endif 


FUHCTIOM  DECLARATION 


PURPOSE:    To  multiply  two  complex  numbers,  zl  and  z2,  and  place  their 
product  in  the  Complex_Type  '*product'. 

INCLUDE:     "complex. h" 

PARAMETERS:  The  parameters  give  the  two  operands  zl  and  z2,  and  a 
pointer  to  the  result,  product. 

EXAMPLE:     Complex.Type  zl,  z2,  z3; 

cmul(zl,  z2,  *z3); 

/ 

#ifdef  PROTOTYPE 

void  cmul(Complex_Type  zl,  Complex.Type  z2,  Complex.Type  *product); 
#else 

void  cmul() ; 
#endif 


/♦ 

♦  PURPOSE: 

* 
* 

*  IHCLUDE: 


FUWCTIOH  DECLARATION 


To  place  the  difference  of  two  complex  numbers,  (zl  -  z2) , 
into  the  Complex_Type  '*dif f erence' . 

"complex. h" 


25S 


complex. h 


PARAMETERS:  The  parameters  give  the  two  operands  zl  and  z2,  and  a 
pointer  to  the  result,  difference. 

EXAMPLE:     Complei_Type  zl,  z2,  z3; 

csub(zl,  z2,  *z3); 


151 

152 

153 

154 

155 

156 

157 

156 

159 

160   */ 

161 

162 

163  #ifdef  PROTOTYPE 

164 

165  void    csub(Complex_Type   zl,    Complex_Type   z2,    Complex_Type   *dif f erence) ; 

166 

167   #else 

168 

169  void  csub()  ; 

170 

I7i   #endif 

172 

173 

174 

175 

176 

177    / 

176 

179 

180 

181 

182 

183 

164 

185 

166 

167 

188 

169 

190       */ 

191 

192  tifdef   PROTOTYPE 

193 

194  double   Im(Complex_Type  z) ; 

195 

196  #else 

197 

196         double   Im() ; 

199 

200  # end  if 
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==========    FUNCTION  DECLARATION    ========== 

PURPOSE:     To  return  the  imaginary  part  of  a  complex  number,  z. 

PARAMETERS:  The  complex  number,  z,  is  passed  into  Im(). 

RETURNS:    The  imaginary  part  of  z  as  type  double;  that  is  a  real 

number  y  so  that  y  *  sqrt(-l)  [or  iy]  is  the  imaginary  part 
of  z. 

EXAMPLE:    y  =  Im(z); 


complex. h  

201 
202 
203 
204 
205 

206  /* ==========        FUMCTIOM  DECLARATIOI        ========== 

207  * 

208  *  PURPOSE:    This  function  returns  the  real  part  of  a  complex  number,  z. 

209  * 

210  *  PARAMETERS:  The  complex  number,  z,  is  passed  into  Re(). 

211  * 

212  *  RETURHS:    The  real  part  of  z  as  type  double. 

213  * 

214  *     EXAMPLE:  x   =   Re(z); 

215  * 

216  * =  =  =  =  =  =  =  =  =  =  =  =  =  =  ====  ===  =  =  ===  =  =  ===  =  ====  =  ===  =  =  ===  = 

217  */ 
216 

219 

220  #ifdef  PROTOTYPE 

221 

222  double   Re(Complex_Type   z) ; 

223 

224  #else 

225 

226  double  Re() ; 

227 

228  #endif 

229 
230 

231   /* ============        EOF  complex. h       ============ */ 
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* 

* 

* 

* 

* 

*  - 

*/ 


#include  <stdio.h> 
#include  "complex. h" 


PROGRAM  INFORMATION 


SOURCE 

VERSION 

DATE 

AUTHOR 

DETAILS 
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1.6 
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See  "complex. h". 


/* 


FUNCTION  DEFINITION 


*/ 


#ifdef  PROTOTYPE 

void  cadd(Complex_Type  zl,  Complex_Type  z2,  Complex_Type  *sum) 
#else 

void  cadd(zl,  z2,  sum) 

Complex_Type  zl, 
z2, 
♦  sum; 

#endif 
{ 

sum->x  =  zl.x  +  z2.x; 
sum->y  =  zl.y  +  z2.y; 


/*   End  cadd() 


*/ 


/* 


FUNCTION  DEFINITION 


*/ 


#iidef  PROTOTYPE 
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51 

52  void  cdiv(Complex_Type  zl,   Complex_Type  z2,    Complex_Type  *quotient) 

53 

54   #else 

55 

56  void   cdiv(zl,    z2,    quotient) 

57 

56  Complex_Type  zl, 

59  z2, 

60  *quotient; 

61  tendif 

62  { 
63 

64  double  d; 

65 
66 

67  if    (fabs(z2.y)    <  fabs(z2.x))    { 

66 

69  d   =    (z2.y   /   z2.x) ; 

70 

7i  quotient->x   =    ((zl.x  +   zl.y   *   d)/(z2.x  +  z2.y   *  d)); 

72  quotient->y   =    ((zl.y  -  zl.x   *   d)/(z2.x  +  z2.y   *  d)); 

73  } 

74  else   { 
75 

76  d   =    (z2.x  /  z2.y) ; 

77 

76  quotient->x   =    ((   zl.y   +  zl.x   *  d)/(z2.y  +  z2.x  *  d)); 

79  quotient->y   =    ((-zl.x   +   zl.y   *   d)/(z2.y   +  z2.x   *   d)); 

60  } 

61  } 

62  /*     End   cdiv()    */ 

63 
64 
85 
86 
87 

^   /* =========         FUNCTION     DEFINITION        ========= */ 

89 
90 

91   #iidef  PROTOTYPE 

92 

93  void  cmul(Complex_Type  zl,    Complex_Type  z2,    Complex.Type  *product) 

94 

95  telse 

96 

97  void  cmul(zl,    z2,   product) 

96 

99  Complex_Type  zl, 

ioo  z2, 
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101 

♦product; 

102  #endif 

103    { 

104 

105 

product->x   =    (zl.x   *   z2.x   -  zl.y  *  z2.y); 

106 

product->y   =    (zl.x   *   z2.y   +  zl.y  *  z2.x); 

107    } 

108    /* 

End 

*/ 

9f 

109 

110 

111 

112 

113 

114    /♦ 
115 

116 

117  #ifdef   PROTOTYPE 

116 

119 

void   csub(Complex_Type   zl,    Complex_Type  z2,    Complex_Type 

♦difference) 

120 

121   #el 

se 

122 

123 

void   csub(zl,    z2,   difference) 

124 

125 

Complex_Type  zl, 

126 

z2, 

127 

♦difference; 

126  ffendif 

129    { 

130 

131 

dif f erence->x   =  zl.x  -  z2.x; 

132 

diff erence->y  =  zl.y  -  z2.y; 

133 

134    } 

135    /* 

End 

V  t  \ 

.-   -          -          *  / 

--   -          -          */ 

136 

137 

13* 

139 

140 
141    /* 

142 

143 

144  #ifdef  PROTOTYPE 

145 

146  double   Im(Complex_Type  z) 

147 

146  #el 

se 

149 

150 

double  Im(z) 
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151 

152  Complex_Type  z; 

153 

154  #endif 

155  { 
156 

157  return(z.x); 

158 

159  } 

160  /*     End   Im()    */ 

161 
162 
163 
164 
165 

166  /* =========  FUNCTION     DEFINITION        ========= */ 

167 
168 

169  #ifdef  PROTOTYPE 

170 

171  double   Re(Complex_Type  z) 

172 

173  #else 

174 

175  double  Re(z) 

176 

177  Complex_Type  z; 

176 

179  #endif 

180  { 
181 

182  return(z.y) ; 

183 

184  } 

185  /*     End  Re()    */ 

186 
187 

188  /* ============        EOF   complex,  c        ============ */ 


2C4 


epsilon.h 


i  h 

2 
3 

4 

5 

6 

7 

6 

9 
10 
11 
12 
13 
14 
15 
16 
17 
16 
19 
20 
21 
22 
23 
24 
25 
26 
27 
26 
29 
30 
31 
32 
33 

34  /* 

35  * 
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REFERENCES 


[1]  Gragg,  William  B.   Personal  conversations,  course  notes,  and  MATLAB 
code,  1991. 


DESCRIPTION 


This  file  contains  declarations  of  functions  that  determine  the  machine 
precision  for  a  particular  machine.  The  definition  of  epsilon  is  given 
below. 


LIST  OF  FUNCTIONS 


epsd() 
epsf () 


36 
37 
36 
39 
40 
41 
42 
43 
44 
45 
46 
47 
46 
49 
50 


FUNCTION  DECLARATION 


PURPOSE:    To  find  the  machine  precision.   The  machine  precision,  eps , 
is  defined  as  the  largest  number  which  satisfies: 

1.0  +  eps  ==  1.0 

This  program  uses  the  type  "double"  which  normally  means  an  8-byte 
(64-bit)  floating-point  number  stored  in  the  IEEE  754  double  precision 
standard  representation  of  [  1  sign  bit  ][  11-bit  exponent  ][  52-bit 
mantissa/signif icand  ]  . 

INCLUDE:     "epsilon.h" 

RETURNS:    The  value  of  epsilon  (double). 
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51        */ 
52 

53  double  epsd() ; 

54 

55 

56 

57 

58 

59    / 

60 

61      *      PURPOSE: 

62 

63 

64 

65 

66 

67 

66 

69 

70 

71 

72 

73 

74  float   epsf  0 ; 

75 
76 
77    /* 


INCLUDE: 
RETURNS : 


FUNCTION  DECLARATION 


This  function  is  identical  to  epsd()  except  that  it  returns 
type  float.   Note:   The  values  returned  may  be  identical, 
probably  reflecting  C  arithmetic  done  in  type  double 
regardless  of  the  ultimate  type  returned.   Anyway,  this 
function  does  everything  using  type  float. 

"epsilon.h" 

The  value  of   epsilon   (float) . 


/ 


EOF  epsilon.h 


*/ 
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-==========    PROGRAM  INFORMATION    ========== 

generate  .h 
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REFERENCES 


[1]  Gragg,  William  B.   Personal  conversations,  course  notes,  and  MATLAB 
codes,  1991. 


==============   DESCRIPTION   =============== 

Declarations  of  matrix  and  vector  generation/initialization  functions 


LIST  OF  FUNCTIONS 


hilbertO 

identity () 

initial_permutation_vector() 

mxrandO 

wilkinsonO 

zeros() 


FUNCTION  DECLARATION 


PURPOSE:    This  function  generates  a  Hilbert  matrix  of  the  specified 
size.   The  function  takes  care  of  memory  allocation,  so 
the  caller  does  not  need  to  do  this.  The  definition  used 
for  a  Hilbert  matrix  is  (for  rows  and  columns  numbered  from 
1)  that  the  element  at  the  (i,j)  position  has  the  value 
(l/(i  +  j  -  1)). 

INCLUDE:    "allocate. h" 
"matrix. h" 

CALLS:  matallocO 

CALLED  BY: 

PARAMETERS:  The  parameters  tell  the  size  of  the  desired  matrix. 

RETURNS:  On  success  (i.e.  no  allocation  problems),  hilbertO  returns 
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the  allocated  matrix  Tilled  with  the  values  as  described. 
A  MULL  return  value  flags  an  allocation  failure. 

EXAMPLE:     Double_Matrix_Type  *A  =  hilbert(5,  7); 


/ 
#ifdef  PROTOTYPE 

Double_Matrix_Type  *hilbert(int  rows,  int  cols); 
#else 

Double_Matrix_Type  *hilbert(); 
#endif 


FUNCTION  DECLARATION 


PURPOSE:    This  function  generates  an  Identity  matrix  of  the  specified 
size.   The  function  takes  care  of  memory  allocation,  so 
the  caller  does  not  need  to  do  this. 

INCLUDE:     "allocate. h" 
"matrix. h" 

CALLS:      matallocO 

CALLED  BY: 

PARAMETERS:  The  parameters  tell  the  size  of  the  matrix. 

RETURNS:    On  success  (i.e.,  no  allocation  problems),  identityO 

returns  the  allocated  matrix  filled  with  the  ones  on  the 
diagonal.   A  NULL  return  value  flags  an  allocation  failure. 

EXAMPLE:    Double_Matrix_Type  *A  =  identity(5,  7); 

/ 

#ifdef  PROTOTYPE 

Double_Matrix_Type  *identity(int  rows,  int  cols); 
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felse 

Double_Matrix_Type  *identity(); 
#endif 


FUNCTION  DECLARATION 


PURPOSE:    To  initialize  a  permutation  vector,  pQ.   This  function 

performs  allocation  for  p[] ,  assuming  that  it  must  contain 
n  integer  elements.   Additionally,  the  function  assigns 
values  pCj]  =  j  for  all  0  <=  j  <  n.   If  allocation  fails,  p 
will  be  BULL  upon  return. 

INCLUDE:  "allocate. h" 

CALLS:  intvecallocO 

CALLED  BY: 

PARAMETERS:  The  size  of  the  vector,  n. 

RETURNS:  (A  pointer  to)  The  vector. 

/ 
#ifdef  PROTOTYPE 

int  *initial_permutation_vector(int  n); 
telse 

int  *initial_permutation_vector() ; 
tendif 


/* 

*   PURPOSE: 


FUNCTION  DECLARATION 


This  function  generates  a  matrix  whose  elements  are  pseudo- 
random numbers  (generated  by  lcdrand()  in  mathx.c). 
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IICLUDE: 


CALLS: 


"allocate. h" 
"mathx.h" 
"matrix. h" 

lcdrandO 
natalloc() 


CALLED  BY: 

PARAMETERS:  The  parameters  tell  the  size  of  the  matrix. 

RETURNS:  On  success  (i.e.,  no  allocation  problems),  mxrandO  returns 
the  allocated  matrix  filled  with  the  random  values.  A  MULL 
return  value  flags  an  allocation  failure. 


EXAMPLE: 


Double_Matrix_Type  *A  =  mxrand(5,  7); 


/ 

#ifdef  PROTOTYPE 

Double_Matrix_Type  *mxrand(int  rows,  int  cols); 
#else 

Double_Matrix_Type  *mxrand(); 
#endif 


/* 

*   PURPOSE: 

* 
* 
* 
* 
* 
* 
* 
* 
* 
* 


FUNCTION  DECLARATION 


This  function  generates  a  Wilkinson  matrix  of  the  specified 
size.   The  function  takes  care  of  memory  allocation,  so 
the  caller  does  not  need  to  do  this.  The  definition  used 
for  a  Wilkinson  matrix  is:  ones  along  the  diagonal,  ones 
along  the  rightmost  column,  zeros  in  the  upper  right 
triangle,  and  (-l)'s  in  the  lower  left  triangle. 


[   1 
[  -1   1 
[  -1  -1   1 
[  -1  -1  -1 


1  ] 

1  ] 

1  ] 

1  ] 
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IICLUDE:     "allocate. hM 
"matrix. h" 

CALLS:      matallocO 

CALLED  BY: 

PARAMETERS:  The  parameters  tell  the  size  of  the  matrix. 

RETURNS:    On  success  (i.e.  no  allocation  problems),  wilkinsonO 
returns  the  allocated  matrix  filled  with  the  values  as 
described.   On  (allocation)  failure,  HilkinsonO  returns 
NULL. 

EXAMPLE:    Double_Matrix_Type  *A  =  wilkinson(5,  7); 

/ 
#ifdef  PROTOTYPE 

Double_Matrix_Type  *silkinson(int  rows,  int  cols); 
Seise 

Double_Matrix_Type  *silkinson() ; 
#endif 


FUNCTION  DECLARATION 


PURPOSE:    This  function  generates  a  matrix  of  the  specified  size, 
where  all  of  the  entries  are  zero. 

IICLUDE:    "allocate. h" 
"matrix. h" 

CALLS:      matallocO 

CALLED  BY: 

PARAMETERS:  The  parameters  tell  the  size  of  the  matrix. 
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RETURNS:     On  success  (i.e.  no  allocation  problems),  zeros()  returns 
the  allocated  matrix  filled  with  zeros.  On  allocation 
failure,  zeros ()  returns  BULL. 


EXAMPLE:    Double_Matrix_Type  *A  =  zeros (5,  7); 


251 

252 

253 

254 

255 

256 

257 

258   */ 

259 

260  #ifdef  PROTOTYPE 

261 

262         Double_Matrix_Type   *zeros(int  rows,    int   cols); 

263 

264   #else 
265 

266         Double_Matrix_Type  *zeros(); 

267 

266  #endif 

269 

270 

271   /* ===========       EOF     generate,  h       =========== */ 
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SOURCE 
VERSION 
DATE 
AUTHOR 


io.h 

2.2 

09  September  1991 

Jonathan  E.  Hartman,  U.  S.  laval  Postgraduate  School 


DESCRIPTION 


Thi6  file  contains  declarations  of  functions  for  matrix  and  vector 
input/output.   The  matrix  structures  such  as  "Double_Matrix_Type"  are 
given  in  "matrix. h". 

The  following  parameters  are  common  enough  to  justify  a  one-time 
explanation  here  (and  not  with  each  occurrence  below) : 

width   the  width  in  which  to  print  a  value 

aft     the  number  of  places  to  print  after  the  decimal  point 


answerO 

f ill_matrix() 

f read_matrix() 

fwrite_matrix() 

getintO 

get_matrix_size() 

pause () 

printmdO 

printvdO 

printvi() 


LIST  OF  FUNCTIONS 


/• 


#def ine 
#def ine 
♦define 
#def ine 

•define 
#def ine 


L0NG.AFT 

L0NG.WIDTH 

SH0RT.AFT 

SHORT.WIDTH 

STD_AFT 

STD_WIDTH 


MANIFEST  CONSTANTS 


8 
12 
2 
5 
3 
8 


*/ 
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FUNCTION   DECLARATION 


PURPOSE: 
■OTE: 

IICLUDE: 

CALLS : 
CALLED  BY: 
PARAMETERS 
RETURNS : 


To  get  a  yes  or  no  answer  from  the  user. 

This  function  includes  the  prompt  "(y/n)?   "  so  you  do  not 
have  to  include  this  in  your  query.  There  is  no  space 
before,  two  spaces  after,  and  no  newline  (i.e.  as  shown). 

<stdio ,h> 
"io.h" 


getcharO 


<stdio.h> 


void . 

(int)  YES  or  HO  (as  defined  in  matrix. h). 


answerO  ; 


FUNCTION  DECLARATION 


PURPOSE: 


PARAMETERS : 


INCLUDE: 


CAUTION: 


CALLS : 


A  function  which  prompts  the  user  for  the  pertinent  data 
about  a  matrix  and  fills  the  structure  provided  with  the 
appropriate  information.  That  is,  this  function  allows  the 
user  to  input  the  values  of  the  elements. 

A  pointer  to  the  structure  containing  the  matrix  to  be 
filled. 

<stdio.h> 
"io.h" 

This  function  ASSUMES  that  the  "rows"  and  "cols"  fields 
have  been  correctly  assigned  by  something  like  matallocO 
[see  "allocate. h"]  and  makes  no  effort  to  enter  a  value  in 
those  fields  of  the  matrix  structure. 


() 
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CALLED  BY: 

PARAMETERS:  The  parameters  tell  the  size  of  the  matrix. 

RETURNS:  The  matrix  associated  with  A  is  operated  on  during  the 
execution  of  the  function,  and  the  result  is  available 
upon  return. 

EXAMPLE:    if  ( !f ill_matrix(*A) ) 

/ 
#ifdef  PROTOTYPE 

void  f ill_matrix(Double_Matrix_Type  *A) ; 
#else 

void  f ill_matrix() ; 
#endif 


FUNCTION  DECLARATION 


PURPOSE:    A  function  which  reads  data  from  a  file  and  stores  it  in 
the  matrix  of  A.   This  function  takes  care  of  matrix 
allocation  for  the  caller. 

INCLUDE:    <stdio.h> 
"io.h" 

CAUTION:    This  function  ASSUMES  the  file  has  been  stored  in  the 
format  described  in  "matrix. f mt" . 

CALLS:      fgetsO 
f scanf () 
rewindO 

CALLED  BY: 

PARAMETERS:  The  pointer  to  the  matrix  structure  and  the  file  pointer. 

RETURNS:     1  on  success  and  0  on  any  sort  of  failure. 
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#ifdef  PROTOTYPE 

int  fread_matrix(Double_Matrix_Type  **A,  FILE  *f  p) ; 
#else 

int  fread_matrix() ; 
fendif 


FUNCTION  DECLARATION 


PURPOSE:    A  function  which  writes  data  from  A->matrixD  []  to  a  file 
pointed  to  by  fp. 

INCLUDE:     <stdio.h> 
"io.h" 

ASSUMPTION:  The  caller  has  already  performed  fopenQ  on  fp  for  the 
"w"  (write)  mode. 

CALLS:      fprintfO 
rewindO 

CALLED  BY: 

PARAMETERS:  A  is  a  pointer  to  the  structure  which  contains  the  matrix, 
fp  is  a  FILE  pointer. 

RETURNS:     1  on  success  and  0  on  failure. 


/ 

#ifdef  PROTOTYPE 

int  fwrite_matrix(Double_Matrix_Type  *A,  FILE  *fp,  int  width,  int  aft); 
#else 

int  fwrite_matrix() ; 

#endif 
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PURPOSE: 
IICLUDE: 

CALLS : 

CALLED  BY: 
RETURNS : 


==========    FUNCTION  DECLARATION    ==========-— 

A  function  to  get  user  input  of  a  single  integer, 

<stdio .h> 
"io.h" 

fflushO 
scanf () 

The  user's  integer. 


int  get int() ; 


==========    FUNCTION  DECLARATION    ========== 

PURPOSE:    A  function  to  ask  the  user  for  the  size  of  a  matrix. 

INCLUDE:    <stdio.h> 
"io.h" 

CALLS:  answer () 
fflushO 
scanf () 

CALLED  BY: 

PARAMETERS:  Pointers  to  the  size  of  the  matrix  (m  rows  by  n  columns) 

/ 
#ifdef  PROTOTYPE 

void  get_matrix_8ize(int  *m,  int  *n) ; 
#else 

void  get_matrix_size() ; 
#endif 
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PURPOSE: 
IICLUDE: 

CALLS : 


-==========   FUNCTION  DECLARATION 

Press  a  key  to  continue! 

<stdio.h> 
"io.h" 

iilushQ 
getcharO 
print! () 


void  pause() ; 


FUNCTION  DECLARATION 


PURPOSE: 


INCLUDE: 


CALLS : 


This  function  provides  a  printout  of  the  information  stored 
in  the  structure  A. 

<stdio .h> 
"io.h" 

printf () 


PARAMETERS:  A  is  the  structure  that  contains  the  matrix  to  be  printed. 
The  width  and  aft  values  are  described  near  the  top  of  this 
file.   The  defaults  are  defined  as  manifest  constants. 

EXAMPLE:    Double_Matrix_Type  *A  =  hilbert(7,  5); 

printmd(*A,  L0NG_WIDTH,  L0NG_AFT) ; 

/ 
#ifdef  PROTOTYPE 

void  printmd(Double_Matrix_Type  A,  int  width,  int  aft); 
#else 

void  printmd() ; 
#endif 
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==========        FUNCTION   DECLARATION        ========== 

PURPOSE:  This   function  prints   the  vector,    v,    of  doubles. 

IICLUDE:  <stdio.h> 

"io.h" 

CALLS:  printfO 

CALLED  BY: 

PARAMETERS:  v  is  the  vector.   size  is  the  number  of  elements  in  vQ 

/ 

#ifdef  PROTOTYPE 

void  printvd(double  *v,  int  size,  int  width,  int  aft); 
#else 

void  printvdO  ; 
#endif 


==========   FUNCTION  DECLARATION    ========== 

PURPOSE:    This  function  provides  a  printout  of  the  integer  vector  v, 

INCLUDE:     <stdio.h> 
"io.h" 

CALLS:      printfO 

CALLED  BY: 

PARAMETERS:  v  is  a  vector  of  size  integers. 

/ 
#ifdef  PROTOTYPE 

void  printvi(int  *v,  int  size,  int  width); 
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#else 

void  printviO  ; 
#endif 


/* 


EOF  io.h 


*/ 
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DESCRIPTION 


A  small  extension  to  the  usual  C  <math.h> 


LIST  OF  FUNCTIONS 


lcdrandO 
lclrand() 
multmodQ 
pow2() 


/* 


#ifndei  EXIT_FAILURE 
#define  EXIT.FAILURE 

#endif 

#define  START 
#define  MULT 
#define  INCR 
#define  SqRTM 
#define  MODULUS 


MANIFEST  CONSTANTS 


-1 


*/ 


1234567  /*  starting  value,  Xo .   See  [l]  */ 

31415821  /*  multiplier,  a.  See  [l]  */ 

1  /*  increment,  c.  See  [l]  */ 

10000  /*  sqrt(m)  */ 

100000000  /*  modulus,  m.  See  [l]  */ 
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FUNCTION  DECLARATIOM 


PURPOSE:    To  calculate  a  pseudo-random  number  in  the  range  [0,  1] 

using  the  linear  congruential  method.  This  function  is  a 
very  simple  application  of  lclrand().  It  merely  divides 
the  value  that  lclrand()  returns  by  the  modulus,  and 
returns  the  resulting  double  value. 

IICLUDE:  "mathx.h" 

CALLS:  lclrandO 

CALLED  BY:  mxrandQ  "generate,  c" 

PARAMETERS:  The  parameters  are  identical  to  those  for  lclrandO . 

RETURNS:  A  pseudo-random  double  value  in  the  range  [0.0,  1.0  ]. 

EXAMPLE:  double  d; 

d  =  lcdrand (START,  MULT,  IHCR,  SQRTM,  MODULUS); 


/ 


#ifdef  PROTOTYPE 

double  lcdrand(long  Xn,  long  a,  long  c,  long  sqrtm,  long  m) ; 
#else  /*  iPSC/2  */ 

double  lcdrand(/*  long  Xn,  long  a,  long  c,  long  sqrtm,  long  m  */) ; 
#endif 


FUNCTION  DECLARATION 


PURPOSE:    To  calculate  a  pseudo-random  number  of  type  long  in  the 

range  [0,  (m-1)] ,  where  m  is  the  argument  for  modulus.  The 
algorithm  uses  the  linear  congruential  method.  This  method 
is  given  in  great  detail  in  [1] .  A  shorter,  algorithmic 
treatment  is  given  in  [2] .   I  have  tested  the  function  to 
be  sure  that  it  produces  the  ten  numbers  listed  on  page  513 
of  [2]. 
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INCLUDE: 


CALLS: 


"mathx .h" 


multmodO 


CALLED  BY:   lcdrandQ 

PARAMETERS:  The  notation  comes  from  [l]  (more-or-less) .   Xn  is  the 

starting  value,   a  is  the  multiplier,   c  is  the  increment, 
sqrtm  is  the  square  root  of  m,  which  is  the  modulus.   A 
negative  value  lor  any  of  the  arguments  is  impossible  and 
sill  invoke  the  defaults  given  among  the  manifest  constants 
above.   The  starting  value,  In,  is  the  exception.   If  you 
supply  a  nonnegative  value,  your  value  will  be  accepted  as 
the  starting  value.   Else,  the  starting  value  BEGINS  at  the 
default  START  and  is  changed  each  time  the  function  is 
called  (as  long  as  the  starting  value  argument,  Xn,  is 
negative).   That  is,  Xn  HAS  MEMORY  as  long  as  your  program 
is  running.   The  other  parameters  are  determined  from  call- 
to-call  . 

RETURNS:    A  pseudo-random  long  in  the  range  [  0,  (m-1)  ],  where  m  is 
the  modulus  argument . 

EXAMPLE:    This  example  illustrates  the  use  of  the  default  values: 

long  1; 

1  =  lclrand(START,  MULT,  INCR,  SQRTM,  MODULUS); 

/ 

#ifdef  PROTOTYPE 

long  lclrand(long  Xn,  long  a,  long  c,  long  sqrtm,  long  m) ; 
#else  /*  iPSC/2  */ 

long  lclrand(/*  long  Xn,  long  a,  long  c,  long  sqrtm,  long  m  */) ; 
#endif 
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FUNCTION  DECLARATION 


PURPOSE:    To  calculate  (a  *  b)  mod  m"2,  while  trying  to  avoid  over- 
flow.  This  function  is  adapted  from  Sedgewick's  'mult' 
function  on  page  513  of  [l] . 

INCLUDE:  "mathx.h" 

CALLS : 

CALLED  BY:  lclrandO 

PARAMETERS:  long  a,  b,  m. 

RETURNS:  long  (a  *  b)  mod  m"2. 

/ 

#ifdef  PROTOTYPE 

long  multmod(long  a,  long  b,  long  m) ; 
#else 

long  multmod(/*  long  a,  long  b,  long  m  */) ; 
#endif 


FUNCTION  DECLARATION 


PURPOSE:    To  calculate  the  value  of  two  raised  to  the  (n)  power.  This 
function  [unlike  the  macro  P0W2()  given  in  macros. h]  will 
handle  the  case  where  (n  ==  0) .  This  function  uses  left 
shifts  to  achieve  the  result,  so  if  you  ask  for  too  large  a 
value,  the  result  is  not  guaranteed.  The  value  of  n  is 
ASSUMED  to  be  a  POSITIVE  integer. 

INCLUDE:    "mathx.h" 

CALLS : 

CALLED  BY: 

PARAMETERS:  The  desired  power  of  two,  n. 
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*   RETURNS: 

* 

*/ 


tifdef  PROTOTYPE 

long  pow2(int  n) ; 

false 

long  pow2(/*  int  n  */) ; 

#endif 


The  function  returns  the  value  of  2"(n) 


/* 


EOF  mathx.h   ============== */ 
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i   /* 

2       * 


PROGRAM     INFORMATION 


3 

4 
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* 
* 
* 

* 

* 
* 
* 

* 
* 
* 

*  ■ 

* 
* 
* 

*  ■ 
* 

* 
* 

* 

*  ■ 
*/ 


SOURCE 
VERSION 
DATE 
AUTHOR 


num_sys .h 
1.4 

09  September  1991 

Jonathan  E.  Hartman,  U.  S.  laval  Postgraduate  School 


REFERENCES 
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DESCRIPTION 


The  "num_sys"  group  of  functions  relate  to  number  systems  (e.g.  binary, 
decimal,  hexadecimal). 


LIST  OF  FUNCTIONS 


binrepO 
binvec() 
hexrepO 
ieeerepO 


FUNCTION  DECLARATION 


PURPOSE: 


INCLUDE: 
CALLS : 


To  display  the  binary  representation  of  a  number.  Given  the 
parameters  described  below,  binrepO  prints  the  binary 
representation.   For  numbers  of  type  double,  type  float,  or 
type  int;  binrepO  reverses  the  order  of  the  bytes  from  the 
machine  storage.   This  makes  them  more  readily  recognizable 
as  [  SIGN  ] [  EXPONENT  ] [  MANTISSA  ]  for  the  floating-point 
types  and  orders  the  bytes  in  order  of  decreasing  signifi- 
cance for  the  integers. 

"num_sys.h" 
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CALLED  BY: 

PARAMETERS:  The  function  needs  to  know  what  type  of  number  you  are 
sending  in,  so  use  the  types  given  in  matrix. h.   The 
function  understands  TYPE_CHAR,  TYPE_DOUBLE,  TYPE.FL0AT, 
and  TYPE_INT)  .   It  also  needs  a  pointer  to  the_number. 

EXAMPLE:     float  f; 

binrep(TYPE_FL0AT,  ftf); 

/ 
#ifdef  PROTOTYPE 

void  binrep(int  number_type,  void  *the_number) ; 
#else 

void  binrepO  ; 
#endif 


FUNCTION  DECLARATION 


PURPOSE: 

INCLUDE: 
CALLS : 
CALLED  BY: 
CAUTION: 


To  expand  the  bits  of  the  input  into  an  array  of  integers 
The  array  only  holds  zeros  and  ones,  with  each  element 
representing  a  bit  of  the  input  number. 

"num_sys.h" 


PARAMETERS: 


RETURNS : 


This  function  returns  the  bits  AS  THEY  ARE  IN  THE  MACHINE! 
Many  machines  store  type  double,  type  float,  and  type  int 
so  that  their  bytes  are  in  an  order  that  is  the  reverse  of 
what  you  might  expect.   Of  course,  the  bits  within  a  byte 
are  in  the  expected  (msb lsb)  order. 

The  function  needs  to  know  what  type  of  number  you  are 
sending  in,  so  use  the  types  given  in  matrix. h.  The 
function  recognizes  TYPE.CHAR,  TYPE.DOUBLE,  TYPE_FL0AT,  and 
TYPE_INT.   It  also  asks  for  a  pointer  to  the  number. 

A  pointer  to  int.  The  function  will  take  care  of  allocation 
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for  this  pointer,  and  it  will  fill  the  array  with  the  bits 
of  the  number.   For  indexing  purposes,  you  will  probably 
need  to  know  how  big  this  vector  is.  Multiply  the 
[sizeof(type  you  are  sending  in)]  by  8  (bits/byte).  That's 
how  many  elements  will  be  in  the  returned  vector  of  integer 
(bits) .   This  pointer  will  be  IULL  if  there  was  an  alloca- 
tion problem. 


EXAMPLE: 


float  f ;  Assume  that  this  takes  4  bytes  *  8  bits 

int   *v;         To  hold  the  bit  vector  of  f  (32  elements) 
v  =  binvec(TYPE_FL0AT,  if); 

/ 
#ifdef  PROTOTYPE 

int  *binvec(int  number_type,  void  *the_number) ; 
#else 

int  *binvec(); 
#endif 


PURPOSE: 
INCLUDE: 
CALLS: 
CALLED  BY: 
PARAMETERS : 


-==========   FUNCTION  DECLARATION   ========== 

To  display  the  hexadecimal  representation  of  a  number. 
"num_sys .h" 


EXAMPLE: 


The  function  needs  to  know  what  type  of  number  you  are 
sending  in,  so  use  the  types  given  in  matrix. h.  The 
function  recognizes  TYPE_CHAR,  TYPE.DOUBLE,  TYPE_FL0AT,  and 
TYPE_INT.   It  also  needs  a  pointer  to  the  number. 

float  f; 

printf("The  hexadecimal  representation  of  */,f  is:  ",  f); 
hexrep(TYPE_FL0AT,  ftf); 
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*  - 


#ifdef  PROTOTYPE 

void  hexrep(int  number_type,  void  *the_number) ; 
#else 

void  hexrepO  ; 
# end if 


FUHCTI0H  DECLARATION 


PURPOSE: 


INCLUDE: 
CALLS : 
CALLED  BY: 
PARAMETERS 


To  display  binary  and  IEEE  representation  of  a  number.  This 
is  nearly  a  tutorial  function!   It  displays  a  binary  repre- 
sentation of  the  number,  and  then  breaks  out  the  sign, 
exponent,  and  mantissa  (or  signif icand) .   Some  terse  trans- 
lation tips  are  also  provided. 

"num_sys .h" 


The  function  needs  to  know  what  type  of  number  you  are 
sending  in,  so  use  the  types  given  in  matrix. h.  This 
function  ONLY  recognizes  the  floating-point  types  (i.e., 
TYPE_D0UBLE  and  TYPE_FL0AT) .   It  also  needs  a  pointer  to 
the  number. 


EXAMPLE: 


float  f; 

printf("The  IEEE  754  representation  of  '/.f  is: 
ieeerep(TYPE_FL0AT,  ftf); 


/ 
#ifdef  PROTOTYPE 

void  ieeerep(int  number_type,  void  *the_number) ; 
Velse 

void  ieeerepO; 


".  i); 
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201   fendif 

202 
203 
204    /* 


EOF  num_sys.h 


*/ 
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SOURCE 

ops.h 

VERSION 

1.7 

DATE 

09  September  1991 

AUTHOR 

Jonathan  E.  Hartman, 

INFORMATION 


U.  S.  laval  Postgraduate  School 


REFERENCES 


[1]  Golub,  Gene  H. ,  and  Charles  F.  VanLoan.  Matrix  Computations 
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The 


DESCRIPTION 


The  functions  declared  below  perform  matrix  and  vector  operations.   For 
the  sake  of  brevity,  I  will  often  use  simple  (MatLab-style)  notation  in 
comments.   For  instance,  x'  means  x  transpose  (i.e.  a  row).  Do  not 
confuse  the  comment  shorthand  with  what  is  really  happening  in  the 
code.   My  goal  is  to  get  function  specifications  across  clearly  and 
succinctly  without  excessive  concern  for  implementation.   Here  are  a 
few  notes. 

An  operation  preceded  by  a  "."  means  "elementwise" .   For  instance, 

x  .*  y  means  the  elementwise  vector  multiplication  of  x  by  y.   That  is, 

the  result  would  be  some  vector  z  like: 

z>    =    [  x[l]*y[l],   x[2]*y[2] x[n]*y[n]   ] 

If  the  operation  appears  without  the  preceding  ".",  it  means  the  vector 
operation. 


LIST  OF  FUNCTIONS 


cols() 

dot_product() 

matrix_product ( ) 

max_element() 

normpO 

outer_product ( ) 

rowsQ 

swap_cols() 

swap_rows() 

vec_init() 
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si  /* ==========   FUNCTION  DECLARATION   ==========— 

52  * 

53  *  PURPOSE:    To  return  the  number  of  columns  in  the  matrix  A 

54  * 

55  *      INCLUDE:  "ops.h" 

56  * 

58       */ 
59 

60  #ifdef  PROTOTYPE 

61 

62  int   cols(Double_Matrix_Type  *A) ; 

63 

64   Xelse 

65 

66  int   cols(/*   Double_Matrix_Type   *A   ♦/) ; 

67 

66  #endif 

69 


70 

71  / 
72 
73 
74 
75 
76 
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FUNCTION  DECLARATION 


PURPOSE:    Computes  the  dot  product  of  the  input  vectors  x  and  y  which 
is  defined  in  [l]  (page  4) .  The  dot  product  of  x  and  y  is 
x'  *  y. 

PARAMETERS:  The  vectors  x  and  y  should  be  arrays  of  type  double,  each 
having  "size"  elements. 

INCLUDE:  "ops.h" 

CALLS:  N/A 

CALLED  BY:  matrix_product ()        [see  below] 

RETURNS:  A  double  (scalar)  value  equal  to  the  dot  product  x'  *  y. 

EXAMPLE:  The  following  example  would  conclude  with  answer  ==  10.0. 

double       answer; 

static  double  x[]  =  {  1.0,  2.0,  3.0  }, 
y[]  =  {  3.0,  2.0,  1.0  >; 

int     size  =  3; 

answer  =  dot_product(x,  y,  size); 
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#ifdef  PROTOTYPE 

double  dot_product (double  *x,  double  *y,  int  size); 
#else 

double  dot_product(/*  double  *x,  double  *y,  int  size  */) ; 
# end if 


==========   FUNCTION  DECLARATION    ========== 

PURPOSE:    To  multiply  matrices  A  and  B,  placing  the  product  in  C. 

INCLUDE:     "ops.h" 

CALLS:      dot_product  [see  above] 

CALLED  BY: 

PARAMETERS:  The  parameters  tell  the  size  of  the  matrix. 

RETURNS:    SUCCESS  if  the  matrices  sere  compatible  for  multiplication 
and  C  contained  enough  space  to  contain  the  entire  result. 
FAILURE  if  A  and  B  were  incompatible  or  C  was  not  big 
enough  to  hold  the  product.  The  values  for  SUCCESS  and 
FAILURE  are  given  in  'matrix. h'. 

EXAMPLE:    Double_Matrix_Type  *A, 

♦B, 
*C; 

if  (matrix_product(A,B,C)  ==  FAILURE)  { 

printf  (*'matrix_product(A,B,C)  failed. \n") ; 

exit(EXIT_FAILURE); 
} 
else  { 

printf ("C  contains  A  *  B.\n"); 
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#ifdef  PROTOTYPE 

int  matrix_product(Double_Matrix_Type  *A, 
Double_Matrix_Type  *B, 
Double_Matrix_Type  *C) ; 
false 

int  matrix_product() ; 

#endif 


FUHCTIOH  DECLARATIOM 


PURPOSE: 

IHCLUDE: 

CALLS: 
CALLED  BY: 


To  search  the  elements  below  and  to  the  right  of  A(k,k)  for 
the  element  that  is  maximum  in  absolute  value. 


<math.h> 
"ops.h" 

fabs() 


[link  using  -lm  if  necessary] 


PARAMETERS:  A  is  the  matrix  (structure),   k  is  the  index  for  a  position 
on  the  main  diagonal,  A(k,k).  The  search  sill  be  conducted 
for  the  area  of  the  matrix  that  lies  below  k  and  to  its 
right : 

(k,k) > 

I  This  is  the  area  that  will  be  searched 
I  for  an  element  of  maximum  absolute  value. 
I  The  search  does  HOT  include  row  k  nor 
I  does  it  include  column  k. 

Parameters  must  also  include  s,  the  address  of  an  integer 
that  will  contain  the  row  number  for  the  maximum  element 
upon  return;  and  t,  an  address  of  an  integer  to  store  the 
column  number  for  the  maximum  element . 

MOTE:       To  search  the  WHOLE  MATRIX,  the  parameter  k  should  be  (-1). 
The  values  of  k,  s,  and  t  should  be  interpreted  as  the  C 
versions  of  indexes  (i.e.  beginning  with  0). 

RETURHS:    The  function  returns  the  maximum  (in  absolute  value) 

element  found  in  A  (type  double).  Additionally,  the  index 
values  for  this  element  are  placed  in  the  variables  pointed 
to  by  s  (row)  and  t  (col). 
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EXAMPLE: 

Double_Matrix_Type  *A; 
double  u; 

int     k , 

8. 

t; 

u  =  max_element(A,  k,  As,  At); 


/ 
#ifdef  PROTOTYPE 

double  max_element(Double_Matrix_Type  *A,  int  k,  int  *s,  int  *t); 
#else 

double  max_element() ; 
#endif 


FUNCTION  DECLARATION 


PURPOSE:    Computes  the  p-norm  of  the  input  vector  x  defined  in  [l] 
(page  53) . 

INCLUDE:    <math.h> 
"ops .h" 


CALLS : 
CALLED  BY: 


iabs() 


PARAMETERS:  x  is  the  vector.   It  must  contain  "size"  elements  of  type 
double.   The  p  argument  is  the  p  of  p-norm. 

RETURNS:    A  double  (scalar)  value  equal  to  the  p-norm  of  x. 

EXAMPLE: 

static  double     x[]    =  {   1.0,    2.0,    3.0  }; 

double  Euclidean_norm_of _x; 
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251  *  Euclidean_norm_of _x  =  normp(x,    2,   3); 

252  * 

253  * =  =  =  =  =  =  =  =  =  =  =  =  =  ===  =  =  ===  =  =  =  ==  =====  =  ===  =  =  ===  =  =  ===  = 

254  */ 
255 

256  #ifdef  PROTOTYPE 

257 

258         double  nonnp(double  *x,    int  p,    int  size); 

259 

260  #else 

261 

262         double  normpO; 

263 

264  #endif 

265 
266 

267  /* ==========        FUNCTION  DECLARATION        ========== 

268  * 

269  *  PURPOSE:    To  place  the  outer  product  of  x  and  y  in  C. 

270  * 

271  *      INCLUDE:  "ops.h" 

272  * 

273  *      CALLS:  N/A 

274  * 

275  *      CALLED  BY:      N/A 

276  * 

277  *  ASSUMPTION:  The  matrix  associated  with  C  is  already  allocated  to  the 
27g  *  proper  size. 

279  * 

260  *  PARAMETERS:  Two  vectors,  x  and  y,  of  sizes  x_size  and  y_size;  and  the 

28i  *  matrix  associated  with  C  to  accept  the  outer  product. 

282  * 

283  *  RETURNS:    The  matrix  associated  with  C  is  filled  with  the  proper 

284  *  values . 

285  * 

266       * =  =  =  =====  =  =  =  =====  =  =  ===  =  =  ===  =====  =  =====  ===  =  =  ===  = 

287       */ 

288 

289 

290  #ifdef  PROTOTYPE 

291 

292  void  outer_product (double   *x,    int  x.size,   double  *y,    int   y_size, 

293  double  **C)  ; 

294  #else 
295 

296       void  outer_product() ; 

297 

296  #endif 

299 

300 
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/* 

*  PURPOSE: 
* 

*  IICLUDE: 


-==========    FUNCTION  DECLARATION    ========= 

To  return  the  number  of  rows  in  the  matrix  A. 
"ops.h" 


*  - 
*/ 


#ifdef  PROTOTYPE 

int  rows(Double_Matrix_Type  *A); 
#else 

int  rows() ; 
#endif 


/* 

* 

* 
* 
* 
* 
* 
* 
* 


==========   FUNCTION  DECLARATION    ========== 

PURPOSE:    To  swap  columns  p  and  q  in  the  matrix  contained  within  A. 

INCLUDE:     "ops.h" 

CALLS:       N/A 

CALLED  BY: 

PARAMETERS:  A  is  the  structure  holding  the  matrix.   The  integers  p  and 
q  are  the  column  numbers  to  be  swapped.   Indexes  are 
numbered  according  to  the  C  convention  (beginning  at  zero) . 


RETURNS 


Upon  return,  the  columns  have  been  swapped  in  A 


*/ 

tifdef  PROTOTYPE 

void  swap_cols(Double_Matrix_Type  *A,  int  p,  int  q) ; 
false 

void  8wap_cols() ; 
tendif 
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PURPOSE: 
IICLUDE: 
CALLS : 
CALLED  BY: 


-==========   FUNCTION  DECLARATIOI   ========== 

To  swap  rows  p  and  q  in  the  matrix  contained  within  A. 

"ops.h" 

I/A 


PARAMETERS:  A  is  the  structure  holding  the  matrix.  The  integers  p  and 
q  are  the  row  numbers  to  be  swapped.  Indexes  are  numbered 
according  to  the  C  convention  (beginning  at  zero). 


RETURNS 


Upon  return,  the  rows  have  been  swapped  in  A. 


/ 
#ifdef  PROTOTYPE 

void  swap_rows(Double_Matrix_Type  *A,  int  p,  int  q) ; 
#else 

void  swap_rows(); 
#endif 


FUNCTION  DECLARATION 


To  initialize  the  vector  v  of  n  integers  with  the  values 
1,  2,  3,  ....  n. 

"ops.h" 


PURPOSE: 

IICLUDE: 
CALLS : 
CALLED  BY: 


ASSUMPTION:  The  vector,  v,  has  already  been  successfully  allocated  as 
an  array  of  n  integers. 

PARAMETERS:  The  vector,  v,  to  be  initialized;  and  its  size,  n. 

RETURNS:    The  vector's  elements  are  set  to  the  new  values  and  these 
values  are  in  v[]  upon  return. 


29S 


ops.h 


401 
402 
403 
404 
405 
406 
407 
408 
409 
410 
411 
412 
413 
414 
415 
416 
417 


*/ 


#ifdef  PROTOTYPE 

void  vec_init(int  *v,  int  n) ; 
#else 

void  vec_init() ; 
tendif 


/* 


EOF  ops.h 


*/ 
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6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
16 
19 
20 
21 
22 
23 
24 
25 
26 
27 
26 
29 
30 
31 
32 
33 
34 
35 
36 
37 
36 
39 
40 
41 
42 
43 
44 
45 
46 
47 
48 
49 
50 


PROGRAM  INFORMATION 


SOURCE 
VERSION 
DATE 
AUTHOR 


timing. h 

1.2 

09  September  1991 

Jonathan  E.  Hartman,  U.  S.  laval  Postgraduate  School 


REFERENCES 


REFERENCES   : 

[1]  Inmos.   The  Transputer  Databook,  Second  Edition,  1989. 
[2]  Intel.   iPSC/2  Programmer's  Reference  Manual. 


DESCRIPTION 


This  file  contains  definitions  of  manifest  constants,  type  definitions, 
and  function  declarations  for  time-related  tasks  on  the  Intel  iPSC/2  or 
a  network  of  Inmos  transputers. 


LIST  OF  FUNCTIONS 


clockQ 
delay () 


/* 


MANIFEST  CONSTANTS 


#ifdef  TRANSPUTER 

#define  L0_PERI0D 
#define  HI_PERI0D 
tdefine  L0_FREq 
#define  HI.FREq 


64.0e-6 
1.0e-6 
15625.0 
1.0e6 


#else   /*   iPSC/2   */ 

#define  M.PERI0D 
#define  M_FREQ 

#endif 


1.0e-3 
1.0e-3 


/*  period  of  low  priority  clock 

/*  period  of  high  priority  clock 

/*  frequency  of  low  priority  clock 

/*  frequency  of  high  priority  clock 


/*  period  of  Intel's  mclockO 

/*  frequency  for  Intel's  mclockO 


*/ 


*/ 
*/ 
*/ 

*/ 


*/ 
♦/ 
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55 
56 
57 
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60 
61 
62 
63 
64 
65 
66 
67 
66 
69 
70 
71 
72 
73 
74 
75 
76 
77 
76 
79 
60 
61 
62 
63 
64 
65 
66 
67 
86 
89 
90 
91 
92 
93 
94 
95 
96 
97 
96 
99 
100 


TYPE  DEFINITIONS 


* 

*  The  type  'ticks'  is  defined  in  an  effort  to  make  timing  a  bit  more 

*  transparent  across  the  machines  listed. 

* 

* ============================================== 

♦/ 

#ifdef  TRANSPUTER 
typedef  int  ticks; 
#else   /*   iPSC/2   */ 
typedef  unsigned  long  ticks; 
#endif 


PURPOSE: 
IMCLUDE: 

CALLS: 

CALLED  BY: 


.=========    FUNCTION  DECLARATION   ========= 

To  get  the  time  (in  ticks)  from  the  processor's  clock 


<conc.h> 
"timing. h" 

Time() 
mclock() 


(Logical  Systems  C,  version  89.1) 


(Logical  Systems  C,  version  89.1) 
(Intel  iPSC/2  C) 


PARAMETERS:  None. 

RETURNS:    The  function  samples  the  clock  and  returns  ticks.   More 

information  on  ticks,  period,  and  frequency  is  given  in  the 
definitions  above. 

EXAMPLE:    ticks  t[2] ; 

t[0]  =  clock(); 
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102 
103 
104 
105 
106 
107 
108 
109 
110 
111 
112 
113 
114 
115 
116 
117 
118 
119 
120 
121 
122 
123 
124 
125 
126 
127 
128 
129 
130 
131 
132 
133 
134 
135 
136 
137 
138 
139 
140 
141 
142 
143 
144 
145 
146 
147 
148 


#ifdef  PROTOTYPE 

ticks  clock(void); 

#else 

ticks  clock(/*  void  */) ; 

# end if 


FUICTI0H  DECLARATI0I 


PURPOSE:    To  force  a  delay  of  at  least  a  given  amount  (in  seconds)  in 
program  execution. 


IICLUDE:    <conc.h> 

"timing. h" 


CALLS : 


CALLED  BY: 


ProcGetPriorityO 

TimeQ 

mclock() 


(Logical  Systems  C,  version  89.1) 


(Logical  Systems  C,  version  89.1) 
(Logical  Systems  C,  version  89.1) 
(Intel  iPSC/2  C) 


PARAMETERS:  The  (float)  argument  tells  the  function  the  minimum  time 
(in  seconds)  to  delay. 

EXAMPLE:    delay(1.25); 


/ 
#ifdef  PROTOTYPE 

void  delay(float  seconds); 
telse 

void  delay(  /*  float  seconds  */  ); 
tendif 


/* 


EOF  timing. h 


*/ 
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E.   GAUSS  FACTORIZATION  CODE 

The  Gauss  factorization  code  appears  on  the  pages  that  follow.  First,  the  code 
for  partial  pivoting  is  given.  Since  the  complete  pivoting  case  was  very  similar,  most 
of  it  has  been  omitted  to  save  space.  The  pivot  election  function,  however,  is  shown 
in  a  fragment  of  gfpcnode.c,  the  node  code  for  GF  with  Pivoting  (Complete). 
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2 
3 

4 
5 
6 

7 
8 
9 
10 
11 
1? 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
2S 
29 
30 
31 
32 
33 
34 
35 
36 
37 
3* 
39 
40 
41 
4  2 
43 
44 
45 
46 
47 
4h 
49 
50 


PURPOSE 

AUTHOR 

DATE 


Makefile  for  Hypercube  Gauss  Factorization  (GF)  Program 
Jonathan  E.  Bartman,  U.  S.  laval  Postgraduate  School 
26  August  1991 


R00TC0DE=gfpphost 
N0DEC0DE=gfppnode 
BEADER=gf 
IIF_FILE=gfpp 


#  OPTIONS  AND  DEFINITIONS 

# 

#  iPSC/2  Section  (MDIR  ==  MatLib  directory) 

MDIR=/usr/hartman/matlib/ 


Transputer  Section 

The  following  section  establishes  options  and  definitions,  starting 
with  PP,  the  Logical  Systems  C  Preprocessor.  The  '-dX'  option  (with  no 
macro_expression)  is  like  '#define  XI'.  Next  the  compilation  options 
for  Logical  Systems'  TCX  Transputer  C  Compiler  are  given.  The  '-c' 
means  compress  the  output  file.  The  options  beginning  with  '-p'  tell 
TCX  to  generate  code  for  the  appropriate  processor: 


-p2 

-p25 

-p4 

-p45 

-p8 

-p85 


T212  or  T222 

T225 

T414 

T400  or  T425 

T800 

T801  or  T805 


Logical  Systems'  TASN  Transputer  Assembler  is  next.  The  '-c'  means 
compress  the  output  file  (it  can  cut  it  in  half)!  The  '-t'  is  used 
because  the  input  to  TASM  will  be  from  a  language  translator  (TCX' 8 
output)  and  not  from  assembly  source  code. 

The  final  list  tells  TLNK  which  libraries  to  look  at  during  linking. 
It  also  establishes  an  entry  point.  We  use  '_main'  for  the  root  node 
and  '_ns_main'  for  other  nodes. 


PP0PT2=-dPR0T0TYPE  -dTRANSPUTER  -dT212 
PP0PT4=-dPR0T0TYPE  -dTRANSPUTER  -dT414 
PP0PT8=-dPR0T0TYPE  -dTRANSPUTER  -dT800 
TCX0PT2=-cp2 
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gfpp.niak  

51  TCX0PT4=-cp4 

52  TCX0PT8=-cp8 

53  TASMOPT=-ct 

54  T2LIB=t21ib.tll 

55  T4LIB=matlib4.tll  t41ib.tll 

56  T8LIB=matlib8.tll  t81ib.tll 

57  REITRY=_main 

58  IENTRY=_ns_main 

59 
60 

61  # DEFAULT  ===>  MAKE  ALL        

62  # 

63  #   Comment  out  one  or  the  other.... 

64  * 

65  #   all:    ipse 

66  #   run:    irun 

67  #   clean:   iclean 
66  all:    transputer 

69  run:    trun 

70  clean:   telean 

71 
72 
73 

74 

75  # ROOT  CODE   

76  # 

77  #   iPSC/2  Section 

76 

79   ipse:    $(R00TC0DE)    $(»0DEC0DE) 

60 

81  S(ROOTCODE):    $(R00TC0DE) . o 

82  cc   $(R00TC0DE) .o  $(MDIR)allocate . o  $(MDIR)clargs .o  $(MDIR)commhost .o   $(MDIR)generate.o 
$(HDIR)epsilon.o  $(HDIR)io.o  $(MDIR)mathx.o  $(MDIR)ops.o   $(MDIR)timing.o  -lm  -host 

-o  $(R00TC0DE) 

83 

84   $(R00TC0DE).o:      $(R00TC0DE) . c      $(HEADER).h 

85 
86 

87  #       Transputer  Section 

86 

89  transputer:  S(ROOTCODE) .tld  $(H0DEC0DE) .tld 

90 

91  $(R00TC0DE).tld:   $(R00TC0DE) . trl 

92  echo  FLAG        c                                  >  S(ROOTCODE) .Ink 

93  echo  LIST       $(R00TC0DE) .map  »  S(ROOTCODE) .Ink 

94  echo   IMPUT     S(ROOTCODE) .trl   »  $(R00TC0DE) .Ink 

95  echo  EMTRY      $(RENTRY)                 »  $(R00TC0DE) .Ink 

96  echo  LIBRARY  $(T4LIB)                 »  $(RO0TC0DE) .Ink 

97  tlnk  $(R00TC0DE).lnk 

96 
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IODE  CODE 


iPSC/2  Section 


99  $(R00TC0DE).trl:   $(R00TC0DE) . tal 

100  tasm  $(R00TC0DE).tal   S(TASHOPT) 
101 

102  $(R00TC0DE).tal:   $(R00TC0DE) .pp 

103  tcx      $(R00TC0DE).pp      $(TCX0PT4) 

104 

105  $(R00TC0DE).pp:   $(R00TC0DE) . c 

106  pp        $(R00TC0DE).c        $(PP0PT4) 

107 
108 
109 
110 
111 

112  # 

113  # 
114 
115  i 
116 

117  S(IODECODE):  $(N0DEC0DE) .o 

us  cc  $(I0DEC0DE) .o  $(MDIR)allocate .o  $(MDIR)commnode. o  $(MDIR)generate.o  $(MDIR)io.o 

$(MDIR)mathx.o  $(MDIR)ops.o  $(HDIR)timing.o  -node  -lm  -o  $(N0DEC0DE) 

119 

120  $(H0DEC0DE).o:    $(N0DEC0DE) . c   $(HEADER).h 

121 

122 

123  #       Transputer  Section 

124 

125  $(I0DEC0DE).tld:   $(N0DEC0DE) .trl 

126  echo  FLAG        c  >        $(H0DEC0DE) .Ink 

127  echo  LIST       $(K0DEC0DE) .map  »     $(N0DEC0DE) .Ink 
126  echo   IIPUT     $(I0DEC0DE).trl  »     $(M0DEC0DE) .Ink 

129  echo  ENTRY     $(NENTRY)  »      $(K0DEC0DE) .Ink 

130  echo  LIBRARY  $(T8LIB)  »      $(M0DEC0DE) . Ink 

131  tlnk  $(M0DEC0DE).lnk 

132 

133  $(I0DEC0DE).trl:   $(H0DEC0DE) . tal 

134  tasm  $(I0DEC0DE).tal   $(TASM0PT) 

135 

136  $(B0DEC0DE).tal:   $(I0DEC0DE) .pp 

137  tcx      $(M0DEC0DE).pp      $(TCX0PT8) 
136 

139  $(I0DEC0DE).pp:   $(I0DEC0DE) . c 

140  pp        $(I0DEC0DE).c        $(PP0PT8) 

141 
142 
143 
144 
145 

146  * 

147  # 


EXECUTION 
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148 
149 
ISO 
151 
152 
153 
154 
155 
156 
157 
156 
159 
160 
161 
162 
163 
164 
165 
166 
167 
168 
169 
170 
171 
172 
173 
174 
175 
176 
177 
176 
179 


irun:  $(R00TC0DE)  S(IODECODE) 
l(ROOTCODE) 

trun:  $(R00TC0DE) . tld  $(M0DEC0DE) . tld  $(IIF_FILE) .nif 
echo  makecube  first 
Id-net  $(IIF_FILE)  -t  -v 


# 

* 

iclean: 

n  $(I0DEC0DE).o 
rm  $(R00TC0DE).o 
rm  S(MODECODE) 
rm  $(R00TC0DE) 

tclea_n  : 

del  $(R00TC0DE).lnk 
del  $(N0DEC0DE).lnk 
del  $(R00TC0DE).map 
del  $(I0DEC0DE) .map 
del  $(R00TC0DE).tal 
del  $(M0DEC0DE).tal 
del  $(R00TC0DE).pp 
del  $(I0DEC0DE).pp 
del  $(R00TC0DE).trl 
del  $(I0DEC0DE).trl 


CLEAN     UP 


#  EOF  gfpp.mak 
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3  ;  SOURCE 

4  ;  VERSION 

5  ;  DATE 

6  ;  AUTHOR 

7  :  USAGE 


gfpp.nif 

1  ;      —========        IETWORK   INFORMATION  FILE        ========- 

2  ; 
gfpp.nif 
1.0 

14  September  1991 

Jonathan  E.  Hartman,  U.  S.  Naval  Postgraduate  School 
Id-net  gfpp 

6  ; 

9  ; 

io  ;   ==============   REFERENCES   ================ 

n  ; 

12  ;  [l]  Inmos .  IMS  B012  User  Guide  and  Reference  Manual.  Inmos  Limited, 

13  ;  1988,    Fig.    26,    p.    28. 

14  ; 

15  ; 

16  ;      ==============        DESCRIPTION        =============== 

17  ; 

is  ;  Network  Information  File  (NIF)  used  by  Logical  Systems  C  (version  89.1) 

19  ;  LD-NET  Network  Loader.  This  file  prescribes  the  loading  action  to  take 

20  ;  place  when  the  'Id-net'  command  is  given  as  in  USAGE  above. 

21  ; 

22  ; 

23  ;   =========    HARDWARE  PREREQUISITES   ========= 

24  ; 

25  ;  NOTE:  There  are  three  node  numbering  systems:  the  one  created  by  Inmos' 

26  ;  CHECK  program,  the  Gray  code  labeling,  and  the  NIF  labeling.   Since  all 

27  ;  three  will  be  used  on  occasion,  I  will  prefix  node  numbers  with  a  C,  G, 
2S  ;  or  N  to  identify  which  system  I  am  using! 

29  ; 

30  ;  The  IMS  B004  and  IMS  B012  must  be  configured  correctly.  The  B004's  T414 

31  ;  has  link  0  connected  to  the  host  PC  via  a  serial-to-parallel  converter, 

32  ;  link  1  connected  to  the  IMS  B012  PipeHead,  link  2  connected  to  the  T212 

33  ;  [communications  manager  (not  used  here)]  on  the  B012,  and  link  3 

34  ;  connected  to  the  IMS  B012  PipeTail  (see  [l]).  By  the  way,  link  2  from 

35  ;  the  B004  goes  to  the  the  ConfigUp  slot  just  under  the  PipeHead  slot 

36  ;  (this  connects  it  to  the  T212) .  Finally,  the  B004's  Down  link  must  run 

37  ;  to  the  B012's  Up  link. 
3*  ; 

39  ; 

40  ;   ====   SETTING  THE  C004  CROSSBAR  SWITCHES  ==== 

41  ; 

42  ;  Once  you  have  connected  the  hardware  in  the  fashion  mentioned  above, 

43  ;  the  system  is  ready  to  be  transformed  to  a  hypercube.  Three  codes  by 

44  ;  Mike  Esposito  are  used  here:   t2.nif,  root.tld,  and  switch. tld.   I  have 

45  ;  a  batch  file  called  'makecube.bat'  that  performs  a  'Id-net  t2'  also. 

46  ; 

47  ;  Mike's  code  passes  instructions  to  the  T212  on  the  B012;  which,  in-tum 
46  ;  tells  the  C004's  how  to  connect  their  switches.   After  the  code  has 

49  ;  executed,  the  (very  specific)  configuration  that  we  are  looking  for 

50  ;  will  exist.   Specifically,  the  following  (output  from  CHECK  /R)  is  what 
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this  process  gives  us 


check  1.21 

#  Part  rate  Mb  Bt  1 

LinkO 

Linkl 

Link2 

Link3  ] 

0  T414b-15 

0.09 

o  1 

HOST 

1:1 

2:1 

3:2  ] 

1  T800C-20 

0.80 

1  | 

4:3 

0:1 

5:1 

6:0  ] 

2  T2   -17 

0.49 

1  [ 

C004 

0:2 

C004  ] 

3  T800C-20 

0.80 

2  I 

7:3 

8:2 

0:3 

9:0  ] 

4  T800C-20 

0.76 

3  I 

9:3 

10:2 

11:1 

1:0  ] 

5  T800d-20 

0.90 

1  [ 

8:3 

1:2 

10:1 

12:0  ] 

6  T800d-20 

0.76 

o  1 

!    1:3 

12:2 

7:1 

11:0  ] 

7  T800d-20 

0.76 

3  I 

13:3 

6:2 

14:1 

3:0  ] 

8  T800d-20 

0.90 

2  I 

14:3 

15:2 

3:1 

6:0  ] 

9  T800C-20 

0.77 

o  1 

3:3 

13:2 

15:1 

4:0  ] 

10  T800d-20 

0.90 

2  I 

16:3 

5:2 

4:1 

15:0  ] 

11  T800d-20 

0.90 

1  [ 

6:3 

4:2 

16:1 

13:0  ] 

12  T800d-20 

0.77 

o  1 

5:3 

16:2 

6:1 

14:0  ] 

13  T800d-20 

0.77 

3  I 

11:3 

17:2 

9:1 

7:0  ] 

14  T800C-20 

0.90 

1 

12:3 

7:2 

17:1 

8:0  ] 

15  T800C-20 

0.90 

2  I 

!   10:3 

9:2 

8:1 

17:0  ] 

16  T800C-20 

0.76 

3 

!   17:3 

11:2 

12:1 

10:0  ] 

17  T800d-20 

0.88 

2 

15:3 

14:2 

13:1 

16:0  ] 

Here  node  CO  is  the  root  transputer  (on  the  IMS  B004)  and  node  C2  is 
the  T212  (on  the  IMS  B012) .  The  other  sixteen  nodes  are  the  T800'8 
that  are  used  for  the  work.  A  logical  interconnection  topology  is 
described  below. 


TOPOLOGY 


The  physical  interconnection  scheme  described  above  is  an  actual  4-cube 
with  one  exception.  The  root  node  (CO)  is  situated  BETWEEN  nodes  CI 
and  C3  (which  would  be  connected  directly  in  the  usual  4-cube) .  This 
gives  us  two  3-cubes:  one  whose  node  labeling  is  GOxxx  and  the  other, 
whose  node  labeling  is  Glxxx  (where  the  xxx  represents  all  permutations 
of  3-bits).  These  are  the  usual  three  cubes,  and  they  will  exist  if  we 
define  the  node  numbering/labeling  correctly. 


STRATEGY 


The  node  labeling  established  by  the  IIF  is  available  via  the  variable 
_node_number  (see  <conc.h>)  in  source  code.  Therefore,  we  would  like  a 
smart  labeling  scheme  in  the  IIF  file  so  that  programming  is  easier. 
This,  of  course,  is  subject  to  the  restriction  that  IIF  labels  begin 
with  II  and  so  on. 

One  such  method  would  be  to  define  a  IIF  labeling  so  that  the  Gray  code 
label  for  a  node  would  be  (_node_number  -2).  In  fact,  this  is 
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ssible  and  the  ad; 

acencies  de 

f  in< 

sd  below  allow  us 

to  reali 

101 

pc 

ze  this 

102 

feature.   Below,  noc 

e  10  is  the 

host  PC,  nod« 

i  11  is 

the  root 

transputer 

103 

(T414  on  the  B004) , 

12  through 

117 

correspond  to  GO 

through 

G15  (the 

104 

nodes  of  a  4-cube), 

and  118  is 

not 

used  (but 

it's  the  T212). 

105 

106 
107 

108 

109  host_server  cio.exe;  (d« 

ifault) 

110 

in 

TRAHSPUTER 

RESET 

DESCRIPTIOI  OF  LIIK  C0HHECTI0IS 

«nn 

COMES 
FROM: 

112 
113 

ID      CODE  (.tld) 

LIIK0 

LIIK1 

LIIK2 

LIEK3 

114 

=== 

=     -==-===-=== 

===== 

===== 

===== 

115 

1 

gfpphost, 

r0, 

0. 

2. 

> 

10 

B004 

116 

2 

gfppnode, 

rl, 

4, 

1. 

3. 

6 

B012 

117 

3 

gfppnode, 

r2, 

11. 

2. 

5, 

7 

116 

4 

gfppnode, 

r5, 

12. 

5. 

8. 

2 

119 

E 

gfppnode, 

r3, 

9. 

3. 

4, 

13 

120 

€ 

gfppnode, 

r7, 

2, 

7. 

14. 

8 

121 

7 

gfppnode, 

r9, 

3, 

9. 

6. 

15 

122 

8 

gfppnode, 

r4, 

6, 

4. 

9, 

16 

123 

9 

gfppnode, 

r8. 

17, 

8. 

7. 

5 

124 

10 

gfppnode, 

rll. 

14, 

11. 

1. 

12 

125 

11 

gfppnode, 

rl3, 

15, 

13. 

10. 

3 

126 

12 

gfppnode, 

rl6, 

10, 

16, 

13. 

4 

127 

13 

,     gfppnode, 

rl2, 

5, 

12. 

11. 

17 

126 

14 

gfppnode, 

r6, 

16. 

6. 

15. 

10 
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PROGRAM   INFORMATION 
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gf.h 

VERSION 

2.5 

DATE 
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AUTHOR 
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Hartman,  U.  S 

SEE  ALSO 

gfpc.mak 

makefile 

gfpp.mak 

makefile 
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host  code 
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node  code 
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node  code 

for  the  complete  pivoting  case 
for  the  partial  pivoting  case 
for  the  complete  pivoting  case 
for  the  partial  pivoting  case 
for  the  complete  pivoting  case 
for  the  partial  pivoting  case 


[l]  Gragg,  William  B.   MATLAB  code  and  personal  conversations,  1991 


DESCRIPTION 


This  header  file  is  shared  by  several  programs  (listed  above).   Each  of 
these  codes  has  something  to  do  with  a  parallel  implementation  of  Gauss 
Factorization  (GF) .   Several  pivoting  strategies  are  supported.   Files 
like  gfpc*.*  represent  a  COMPLETE  pivoting  strategy,  and  the  files  like 
gf pp* . *  give  the  corresponding  code  for  the  PARTIAL  pivoting  scheme. 

The  basic  algorithm  is  from  [1] .   Parallelism  is  sought  by  distributing 
the  columns  of  A  across  the  nodes  of  a  multiprocessor  system  (using  the 
hypercube  interconnection  topology).   The  program  is  designed  for  the 
Intel  iPSC/2  or  a  network  of  Inmos  transputers. 

The  algorithm  factors  Q'AP  =  LU  with  P  and  Q  permutation  matrices,  L 
unit  lower  trapezoidal  (r  columns)  and  U  upper  trapezoidal  with  nonzero 
diagonal  elements  (r  rows).   The  program  is  designed  for  a  general 
matrix,  A.   It  does  not  assume  A  square  or  sparse.   There  is  no  effort 
to  optimize  for  this,  or  any  other,  special  structure.   There  is  one 
caveat:   I  designed  the  code  to  gather  data  for  square  matrices  of  full 
rank.   Therefore,  I  have  tested  the  square  case  of  random  matrices  very 
carefully.   While  the  code  should  work  for  any  general  matrix,  it  has 
not  been  carefully  tested  in  other  cases.   Additionally,  since  I  sought 
timing  data  for  matrices  of  full  rank,  I  have  I0T  addressed  the  problem 
of  gathering  columns  (back  to  the  host)  to  the  right  of  the  final  pivot 
for  rank-deficient  matrices.   This  would  not  be  a  difficult  task,  but  I 
did  not  make  this  effort  since  it  has  no  bearing  on  my  goal. 

In  the  partial  pivoting  code,  the  search  for  pivots  is  carried  out  only 
in  the  pivot  column,  so  P  is  the  identity  (i.e.,  there  are  no  column 
interchanges).   Many  of  the  remaining  comments  pertain  to  the  complete 
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54 
55 

56 
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51  *  pivoting  case,  since  it  is  the  most  challenging.   The  changes  for  the 

52  *  partial  pivoting  case  should  be  evident  in  most  cases.   At  times,  when 

53  *  the  changes  are  not  necessarily  evident,  clarifying  remarks  address  the 
partial  pivoting  scheme.   This  header  file  contains  the  majority  of  the 
background  and  algorithm  information,  but  if  you're  after  a  careful 
study  of  the  differences,  compare  the  source  codes.  The  algorithm  below 

57  *  gives  a  road  map  through  the  code. 

58 
59 
60 
61 
62 
63 

64  /„ ===========      ALGORITHM:    BACKGROUID      ========== 

65  * 

66  *  1.)  Preliminaries.   Consider  A  (m  x  n) ,  a  matrix  of  real  numbers.   The 

67  *  permutation  vectors,  p  and  q,  characterize  column  and  row  permutations 
66  *  (respectively).   The  scalar,  (g/a) ,  is  the  growth  factor.  The  integer, 

69  *  r,  i6  a  fairly  reasonable  determination  of  the  'numerical  rank'  of  A. 

70  *  The  C  language  convention  is  followed,  numbering  rows  and  columns  from 
7i  *  zero;  and  storing  dynamic,  two-dimensional  arrays  (matrices)  in  row- 

72  *  major-order.   The  'pivot'  will  be  that  element  located  at  A(k,k).  The 

73  *  area  (in  A)  below  and  to  the  right  of  the  pivot  [all  A(i,j)  where  i  >  k 

74  *  and  j  >  k  ]  is  called  the  'Gauss  transform  area'. 

75  * 

76  *  2.)  Communications  and  Coordination.   Let  M  be  the  number  of  processors 

77  *  (workers)  in  the  hypercube .   These  nodes  are  labeled  with  a  Gray  code 

76  *  {  0  ..  (I  -  i)  }.   The  root  (host)  node  distributes  the  columns  of  A  to 

79  *  the  nodes.   This  is  done  cyclically,  using  the  C  modulus  operator  ('/,). 

80  *  That  is,  column  j  will  be  sent  to  processor  (j  mod  I).   Once  the  nodes 

81  *  have  their  columns,  they  begin  work.   Communication  (for  the  complete 

82  *  pivoting  case)  involves  an  election  process  for  the  next  pivot,  where 

83  *  each  of  the  nodes  finds  its  best  candidate  and  then  the  election  finds 

84  *  the  best  candidate  in  the  global  picture.   This  is  done  in  lg(M)  steps 

85  *  using  the  cubecast_from()  function. 

86  * 

87  *  The  partial  pivoting  case  does  not  require  the  election  process  that 

88  *  complete  pivoting  needs,  but  both  methods  look  similar  (in  terms  of 

89  *  communication)  after  the  elections  are  complete.   The  node  holding  the 

90  *  pivot  column  must  perform  the  pivot  column  arithmetic  and  distribute 

91  *  the  resulting  pivot  column  (also  in  lg(I)  steps)  to  the  other  nodes. 

92  *  Communications  functions  are  not  explained  much  in  this  code,  but 

93  *  details  can  be  found  in  the  files  comm.h  Jt  comm.c. 

94  * 

95  *  3.)  Pivoting  Strategy.   The  complete  pivoting  strategy's  election 

96  *  process  (at  each  stage),  determines  the  element  in  (the  entire  Gauss 

97  *  transform  area  of)  A  that  is  largest  in  absolute  value.   This  element 

98  *  wins  the  election  and  is  'moved'  to  A(k,k)  for  the  upcoming  stage.   It 

99  *  isn't  really  moved... but  p  and  q  are  updated  so  that  we  can  keep  track 
ioo  *  of  permutations.   During  the  search  for  the  new  pivot,  candidates  are 
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denoted  A(s,t)  =  u.   The  largest  of  the  candidates  is  installed  as  the 
next  pivot.   There  seems  to  be  too  much  overhead  associated  with  this 
fancy  indexing  off  of  p[]  and  qQ  .   For  the  partial  pivoting  code,  I 
chose  to  ACTUALLY  SWAP  rows  (if  necessary)  at  each  stage.   This  makes 
the  'pp'  code  a  bit  easier  to  read. 

4.)  Stopping.   The  GF  process  is  repeated  until  one  of  two  criteria  is 
satisfied.   First,  of  course,  we  may  run  out  of  matrix.   Secondly,  we 
may  find  a  pivot  whose  absolute  value  is  less  than  our  tolerance  (tol) , 
In  the  latter  case,  we  have  a  rank-deficient  A.   Currently,  the  codes 
recognize  rank-deficiency  and  bail  out  of  the  iteration  loop;  but  they 
do  not  gather  (to  the  host)  all  of  the  remaining  columns  to  the  right 
of  the  last  pivot.   This  is  discussed  above. 


ALGORITHM:  THE  GF  PROCESS 


0.)  Initialization.   Let  dim  be  the  dimension  of  the  hypercube.   Let 
k  =  0.   Search  A  and  find  the  largest  (in  absolute  value)  element,  u. 
This  is  done  at  each  node.   Once  each  node  has  a  local  candidate  for 
the  next  pivot,  an  election  is  held,  dimension-by-dimension.   This 
requires  (dim)  steps,  and  when  it  is  finished,  every  processor  knows 
exactly  the  position  and  value  of  the  next  pivot.   Exception:   In  the 
partial  pivoting  code,  the  processor  which  has  the  pivot  column  simply 
searches  the  (proper  part  of  the)  pivot  column  for  the  next  pivot  and 
then  informs  the  other  processors. 

1.)  Status.   Every  node  knows  the  position  and  value  of  the  next  pivot, 
namely  u  =  A(s,t);  and  where  it  should  be  installed,  A(k,k).  The  growth 
rate  is  adjusted:   g  =  maxCg,  abs(u)].   If  (u  <  tol),  then  A  is  rank- 
deficient  and  we  exit  the  loop  (using  the  C  'break'  statement). 

2.)  Permutations.   Ve  account  for  the  interchange  of  rows  s  and  k  and 
columns  t  and  k  by  swapping  the  elements  of  pD  that  are  indexed  by  k 
and  t  and  swapping  the  elements  in  qO  indexed  by  k  and  s.   This 
(effectively)  establishes  the  new  pivot  at  A(k,k).   The  column  permu- 
tation vector  has  no  significance  in  the  partial  pivoting  case  since 
it  would  never  be  changed.   The  matrix,  P,  in  this  case,  is  simply  the 
identity . 

3.)  Adjust  the  Gauss  Transform  Area. 

(a)  In  the  (single)  node  that  holds  the  new  pivot's  column  (k), 
divide  every  element  below  the  pivot  by  the  pivot  value.  Broadcast 
this  column  to  every  other  node.   lode  0  updates  the  manager,  who 
uses  this  information  to  append  to  his  copy  of  the  resulting 
(factored)  A. 

(b)  How  every  worker  has  the  updated  column  k.  At  every  node,  do 
the  following:   For  every  element  A(i,j)  [  where  i  >  k  and  j  >  k  ] 
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let  A(i,j)  =  A(i,j)  -  (A(i,k)  *  A(k,j)). 

4.)  Pivot  Search.   In  the  Gauss  transform  area,  G,  search  for  the 
element  that  is  largest  in  absolute  value.   Its  position  is  A(s,t)  and 
its  value  is  u.   The  candidates  are  chosen  at  the  local  (processor) 
level,  then  an  election  is  held  at  the  global  level  to  determine  the 
best  candidate  in  the  same  manner  that  was  described  in  step  0. 
Increment  k.   Repeat  the  process  (go  back  to  step  1).   The  obvious 
exceptions  apply  to  the  partial  pivoting  case. 


MOTES  FOR  IMPR0VEMEIT 


Currently  the  code  does  not  give  full  support  for  rank-deficiency.   It 
DOES  break  out  of  the  loop,  but  everything  to  the  right  of  the  final 
pivot  column  will  be  garbage.   It  would  be  relatively  easy  to  add  the 
necessary  post-iteration  rank-deficiency  check  and  coalesce  each  of  the 
remaining  columns  back  to  the  manager,  but  this  code  was  created  to 
test  the  full-rank  cases  and  take  performance  data. 

Secondly,  there  is  the  issue  of  whether  it  is  better  for  the  manager  to 
receive  each  pivot  column  as  it  becomes  available,  or  if  all  columns 
should  be  sent  in  at  the  end.   I'm  not  yet  sure  which  method  is  better, 
but  the  current  code  keeps  the  root  node  up-to-date  at  each  stage.  This 
is  probably  the  best  solution  to  the  problem  above  and  would  probably 
enhance  performance  during  the  iterations!   It  REALLY  SHOULD  BE  TESTED! 

There  are  many  other  questions  that  pertain  to  optimization  that  remain 
unanswered  (especially  in  the  complete  pivoting  case). 


ALGORITHM:  C0BCLUSI0H  ==== 


1.)  Rank.   Set  r,  the  rank  of  A,  equal  to  the  number  of  iterations  that 

were  executed.  This  is  automatic  in  the  manager  (host)  code  since 
the  integer,  r,  is  used  as  the  loop  index.  The  worker  nodes  use  k  for 
a  loop  index  variable. 

2.)  Interchanges.   Row  and  column  interchanges  are  not  actually  done  in 
the  complete  pivoting  code.   Instead,  we  maintain  permutation  vectors, 
p[]  and  q[] .   You  may  note  that  while  both  vectors  are  used  heavily 
during  the  GF  process  q[] ,  in  particular,  comes  in  handy  at  the  end  to 
set  A  in  order.   The  partial  pivoting  code  performs  the  actual  inter- 
changes of  rows.   At  first,  we  would  be  inclined  to  believe  that  the 
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indexing  by  p[]  and  q[]  leads  to  better  performance,  but  there  is  no 
clear  timing  evidence  (at  this  point)  that  supports  this  idea. 

3.)  Factors.   The  upper  trapezoidal  matrix,  U,  is  the  upper  trapezoid 
oi  (the  resulting,  factored)  A  (the  diagonal  of  A  and  everything  above 
that).   The  lower  trapezoidal  matrix,  L,  is  formed  by  placing  ones  on 
the  diagonal  of  A;  zeros  above;  and  copying  the  lover  trapezoid  of  A 
(excluding  the  diagonal).   To  form  Q'AP,  we  use  THE  ORIGINAL  copy  of  A 
(not  the  factored,  resulting  A)  and  the  matrices  Q  and  P  that  are 
implied  by  q[]  and  p[]  .   That  is,  in  the  end,  we  set  Q[q[i]][i]  =1.0 
for  all  i  in  {  0,  1,  ....  (m-1)  >  and  set  P[p[j]][j]  =  1.0  for  all  j 
in  {  0,  1 (n-1)  }. 


Section  1:   Communications  Aids  (Message  Types  and  Type  Selectors) 

The  following  manifest  constants  simplify  the  communications  effort. 
The  TRANSPUTER  section  is  fairly  general  in  nature.   The  iPSC/2  section 
specifies  types  and  type  selectors  for  csend()  and  crecv().   It  IS 
SIGNIFICANT  that  N0DE_0FFSET  is  the  largest  of  these.   It  must  remain 
the  largest  so  that  (for  all  nodes  n)  the  value  of  (n  +  N0DE_0FFSET) 
cannot  be  equal  to  one  of  the  other  message  types  (consider  n  ==  0) . 


/*  - 

* 

* 

* 
* 
* 

* 
* 

*  ■ 
*/ 


#ifdef  TRANSPUTER 

#define  CUBESIZE   8 
#define  DIMENSION  3 

#else  /*  iPSC/2  */ 

#define  ARG_TYPE 
#define  C0L_SIZE_TYPE 
♦define  C0L.TYPE 
#define  PIV0T_TYPE 
#define  PC0L.TYPE 
#define  R0W_SIZE_TYPE 
#define  N0DEOFFSET 


MANIFEST  CONSTANTS 


/*   change  these  for  a  cube  of  other  dim 


*/ 


/*  for  passing  command  line  argument  info 

/*  for  sending  n  part  of  size(A)  ==>  cols 

/*  use  this  to  send  a  column 

/*  candidate  for  next  pivot 

/*  use  this  to  send  a  pivot  column 

/*  for  sending  m  part  of  size(A)  ==>  rows 

/*  for  sending  messages  from  nodes 


•/ 

*/ 
*/ 
*/ 
*/ 
*/ 
*/ 
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tendif 


Section  3:   Timing 

The  root  uses  a  two-dimensional  array  where  the  rows  are  indexed  by  the 
node  numbers  and  the  columns  use  the  following  indexing.   The  nodes,  of 
course,  only  need  a  one-dimensional  array  with  indexing  according  to 
the  following  scheme.   There  a  total  of  MAX_EVEITS  elements  in  the 
array,  and  indexing  for  a  specific  event  is  given  by  START_TIME,  SETUP, 
and  so  on.   The  partial  pivoting  case  does  not  use  all  of  the  events. 


#define  MAX.EVENTS 


#define  DATA.S0URCE 
#define  START.TIHE 
#define  SETUP 
#define  DISTRIB_C0LS 
tdefine  FIRSTPIV0T 


18   /*  number  of  events  that  we  want  to  time 


*/ 


0  /*  node  number  of  source  of  the  data        */ 

1  /*  t(0)  ==>  starting  time  for  the  node      */ 

2  /*  from  t(0)  until  starting  to  receive  cols  */ 

3  /*  time  to  distribute  columns  */ 

4  /*  from  receipt  of  last  col  to  start  iter   */ 


/*  The  next  two  only  apply  to  nodes  zero  and  eight  */ 

#define  PC0LS_T0_H0ST   5  /*  time  spent  passing  pivot  cols  to  host  */ 

#define  PIVOTS_TO_H0ST  6  /*   time  spent  passing  pivots  to  host  */ 

/*  The  next  five  kind  of  represent  the  big  picture  */ 

#define  PIV0T_ELECTI0N  7  /*   time  spent  on  pivot  elections  */ 

♦define  UPDATING_PQ     8  /*  time  spent  updating  permutations  p  and  q  */ 

#define  PC0L_ARITHMETIC  9  /*  time  spent  on  pivot  column  arithmetic  */ 

tdefine  PC0L_DISTRIB   10  /*  time  spent  distributing  pivot  columns  */ 

tdefine  UPDATIMG_G     11  /*  time  spent  updating  the  Gauss  transform  */ 

/*  The  next  four  are  times  from  within  update_G()  */ 

#define  PRLTIME       12  /*  pivot  row  location  time  */ 

tdefine  LCTIME         13  /*  time  to  determine  if  a  column  is  local  */ 

tdefine  G_ARITHMETIC   14  /*  time  spent  on  arithmetic  within  G  */ 

tdefine  L00PTIME       15  /*  time  for  both  for()  loops  in  update_G()  */ 

/*  The  last  two  are  back  at  the  big  picture  level  again  */ 

tdefine  ITERATI0I      16  /*  time  checked  before  and  after  iteration  */ 

tdefine  STOP  17  /*  the  last  time  sampled  by  the  node  */ 
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301 
302 
303 

304  /* 

305  * 

306  *     Section  4:      General 

307  * 

308  * 

309  */ 
310 

311  idefine  AFT  4   /*   number  of  digits  to  print  after  decimal   */ 

312  tdeiine  WIDTH  6   /*  number  of  characters  (including  decimal)  */ 

313 
314 
315 
316 

317 

318  /* 

319  * 

320  *   Section  5:   A  special  flag  used  for  the  id  field  of  a  pivot.   When  it 

321  *  appears,  it  indicates  that  the  sending  node's  part  of  A  has 

322  *  no  elements  as  big  as  the  tolerance,  tol;  and  therefore  this  node's 

323  *   candidate  for  pivot  should  not  be  considered. 

324  * 

325  * 

326  */ 
327 

328 

329  #define  RANK.DEFICIENT  -1 

330 
331 
332 
333 
334 

335   /* =  =  =========        TYPE     DEFINITIONS        ========  === */ 

336 
337 

338  typedef   struct   { 

339 

340  int  id; 

341  double     u; 

342  int  s , 

343  t ; 
344 

345  }  Pivot_Type; 

346 
347 

348  /* ===============       EOF  gf.h       =============== */ 
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SOURCE 
VERSION 
DATE 
AUTHOR 


/*  ■ 

* 
* 
* 
* 

*  - 

* 

*  - 
*/ 


•include  <stdio.h> 
•include  <string.h> 

•ifdef  TRANSPUTER 

•include  <conc.h> 
•include  <stdlib.h> 


PROGRAM  INFORMATION 


gfpphost . c 

2.0 

21  September  1991 

Jonathan  E.  Hartman,  U.  S.  Naval  Postgraduate  School 


DESCRIPTION 


Gauss  Factorization  (GF)  with  Partial  Pivoting:   Parallel  Version. 
This  is  the  manager  portion  ol  the  code.   See  [gf.h]  lor  details. 


/*   addfreeO,  _heapend 


•include 
•include 
•include 
•include 
•include 
•include 
•include 
•include 
•include 
•include 


<matrix.h> 
<macros .h> 
<allocate.h> 
<clargs .h> 
<comm.h> 
<epsilon.h> 
<generate .h> 
<io.h> 
<ops .h> 
<timing.h> 


•else   /*  iPSC/2  */ 

•include  "/usr/hartman/matl 
•include  "/usr/hartman/matl 
•include  "/usr/hartman/matl 
•include  "/usr/hartman/matl 
•include  "/usr/hartman/matl 
•include  "/usr/hartman/matl 
•include  "/usr/hartman/matl 
•include  "/usr/hartman/matl 
•include  "/usr/hartman/matl 
•include  "/usr/hartman/matl 
•endiz 

•include  "gf.h" 


ib/matrix.h" 

ib/macros.h" 

ib/allocate.h" 

ib/clargs.h" 

ib/comm.h" 

ib/epsilon.h" 

ib/generate.h" 

ib/io.h" 

ib/ops.h" 

ib/timing.h" 


*/ 
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MANIFEST  CONSTANTS 


*  The  following  manifest  constants  axe  used  to  determine  the  size  of  the 

*  option  list,  optv[];  indexing  associated  with  valid  command  line 

*  arguments;  and  selection  constants  for  the  user's  choice  of  matrix  type 

*  [used  in  generate()] . 

* 


#define  IUMBER_0F_ARGS  3 

•define  DIN  0 

#define  TIMING  1 

♦define  VERBOSE  2 

♦define  SELECT_QUIT  0 

♦define  SELECT.IDENTITY  1 

♦define  SELECT.HILBERT  2 

♦define  SELECT_RAND0M  3 

♦define  SELECT  WILKIMS0N  4 


/*  -d  -t  -v 

/*  index  into  optv[] 
/* 

»  H           II              II 

/*  menu  /  matrix  selection 


♦/ 

*/ 
*/ 

*/ 

*/ 


/* 


GL0BALS 


*/ 


static  char  versionC]  =  "Parallel  GF  with  Partial  Pivoting,  Version  2.0"; 

♦ifdef  TRANSPUTER 

Channel  *ic[(CUBESIZE  +  1)3, 
*oc[(CUBESIZE  +  1)3; 

♦else  /*   iPSC/2   ♦/ 

static  char  *cubename; 

static  char  *nodecode  =  "gfppnode"; 

♦endif  /*  TRANSPUTER  */ 

static   Arg_Struct   *optv[NUMBER_0F_ARGS] ; 
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101 
102 
103 

104  /* ===========        FUNCTION     DEFIIITIOI        =========== 

105  * 

106  *  The  structure  is  defined  more  careiully  in  clargs.h,  but  the  basic  idea 

107  *  is  that  we  have  an  array  of  pointers  to  type  Arg_Struct . . . in  this  case, 

108  *  there  are  NUMBER_0F_ARGS  valid  arguments  and  the  next  few  steps  take 

109  *  care  of  allocation  and  definition  of  them.   The  -d  argument  allows  the 
no  *  user  to  enter  the  desired  dimension  of  the  hypercube,  -t  sets  timing  on 
in  *  and  -v  is  used  to  set  verbose  on. 

112   */ 
113 

in  void  def ine_valid_args()  { 

115 

116     static  int  interpret  []  =  {  LONG  }; 

117 
116 

119  install_complex_arg(DIM,  optv,    "-d",    interpret,    1); 

120 

121  install_simple_arg(TIMING,     optv,    "-t"); 

122  install_simple_arg(VERBOSE,    optv,    "-v"); 

123 

124  } 

125  /*   End  def ine_valid_args()    */ 

126 
127 

126 
129 
130 

131  /* ===========        FUNCTION     DEFINITION        =========== 

132  * 

133  *  A  simple  function  to  display  the  results.... 

134  */ 
135 

136  #ifdef  PROTOTYPE 

137 

136  void  display_timing_data(Double_Matrix_Type   *A, 

139  int  dim, 

140  double  a, 

141  double  «ps, 

142  double  g, 

143  double  tol, 

144  int  r, 

145  double  **t) 

146 

147  #else 

148 

149  void  display_timing_data(A,   dim,    a,    eps ,    g,    tol,    r,   t) 

150 
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1S1 

Double 

.Matrix 

.Type 

*A; 

152 

int 

dim; 

153 

double 

a, 

154 

eps, 

155 

g. 

156 

tol; 

157 

int 

r; 

158 

double 

**t; 

159 

160 

#endif 

161 

{ 

162 

int 

aft, 

163 

cubesize   =   po 

w2(dim) , 

164 

i, 

165 

■   =   A- 

>rows , 

166 

n  =  A- 

>cols , 

167 

width; 

166 

169 

170  #ifdef  TRANSPUTER  /*  is  measured  in  64  microsecond  ticks  ==>  4-5  places  */ 

171 

172  aft        =  5; 

173  width  =    15; 

174 

175  #else     /*    iPSC/2   is  measured  in  milliseconds   ==>  three  places*/ 

176 

177  aft        =  3; 

176  width   =   13; 

179 

160  #endif 

181 

182  printfC =========     TIMING     DATA      ========= "); 

183  printfC \n\n"); 

184 

185  printfC"  Hypercube  of   order  '/,d   ",    dim); 

186  (dim  ==   0)    ?    (printf("(l  processor)\n\n"))    : 

187  (printfC"  C/.d  processors)\n\n"  ,    cubesize)); 

188 

189  printf  ("Problem  size  ==>  size(A)  =  ('/.d  x  */.d).\n",  m,  n)  ; 

190  printf  ("Machine  precision:   eps  =  */,e\n" ,  eps); 

191  printf  ("Tolerance:  tol  =  */.e\n" ,  tol); 

192  printf  ("Growth  factor:       g/a  =  '/.e\n",  (g/a)); 

193  printf  ("Rank:  rank  (A)  ='/,3d\n" ,   r   ); 

194  printf ("Units  for  timing  data:  =   seconds\n"); 

195 

196  for   (i   =  0;    i  <  cubesize;    i++)    { 

197 

198  printf  ("\nNode  '/.2d  Data ",    i)  ; 

199  printfC \n\n") ; 

200 
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201 
202 
203 
204 
205 
206 
207 
208 
209 
210 
211 
212 
213 
214 
215 
216 
217 
218 
219 
220 
221 
222 
223 
224 
225 
226 
227 
226 
229 
230 
231 
232 
233 
234 
235 
236 
237 
238 
239 
240 
241 
242 
243 
244 
245 
246 
247 
248 
249 
250 


print* ("Setup  and   initialization:  "); 

printf  ('"/.*.  *lf",    width,    aft,    t  [i]  [SETUP]  )  ; 
print! ("\nlnitial   column  distribution:  "); 

printf  (•"/.*. *11",   width,    aft,   t[i]  [DISTRIB.COLS])  ; 

if    (i   ==  0)    { 

printf ("\nTransmission  of  pivot  columns  to  the  host:    "); 
printf  C"/.*.  *lf",  width,  aft,  t  [i]  [PC0LS_T0_H0ST] ) ; 
printf ("\nTran8mis8ion  of  pivots  to  the  host:  "); 

printf  (•"/.*.  *lf",  width,  aft,  t  [i]  [PIV0TS_T0_H0ST]  )  ; 


printf ("\nPerf ormance  of  pivot  column  arithmetic:       ") 
printf  ("*/.*.  *lf",  width,  aft,  t[i]  [PC0L_  ARITHMETIC]  ) ; 
printf ("\nDistribution  of  pivot  columns:  ") 

printf  ('••/.*.  *lf",  width,  aft,  t [i] [PCOL.DISTRIB] ) ; 
printf ("\nPerf ormance  of  updates  and  arithmetic  in  G:    ") 
printf  ('"/.*.  *lf",  width,  aft,  t  [i] [UPDATING.G] ) ; 
printf ("\nUpdate_G() :   loop  time  including  arithmetic:   ") 
printf  ('••/.*.  *lf",  width,  aft,  t  [i]  [L00PTIME]  ) ; 


printf ("\n\nTime  for  all  work  inside  main  iteration  loop:  "); 
printf  (•"/.*.  *lf",  width,  aft,  t[i]  [ITERATION]  ) ; 
printf ("\nTotal  time  from  start  to  stop:  "); 

printf  ('"/.*.  *lf\n\n",  width,  aft,  (t  [i]  [ST0P]-t  [i]  [START.TIME]  )) ; 


} 

/*  End  display_timing_data() 


*/ 


/*  - 

* 

* 

* 

* 

* 

* 

*  • 

*/ 


#ifdef  PROTOTYPE 


FUNCTION  DEFINITION 


This  function  distributes  the  columns  of  A  to  the  nodes  of  the  hyper- 
cube.   The  loop  variable,  j,  designates  each  column  of  A  in  turn.  The 
column  buffer,  cbuf [] ,  copies  from  A  the  column  to  be  transmitted. 
After  cbuf []  i6  filled,  [i  =  (j  mod  cubes ize)]  means  that  node  i  will 
get  column  j  and  the  modulus  operation  seems  to  be  a  reasonable  and 
efficient  scheme  of  distribution.   Finally,  the  call  to  send()  ships 
the  column  out  to  the  appropriate  node. 
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25i  void  distribute_columns(Double_Matrix_Type  *A,    int  dim,   double   *cbuf) 

252 

253  telse 

254 

255  void  distribute_columns(A ,    dim,    cbuf) 

256 

257  Double_Matrix_Type  *A; 

256  int  dim; 

259  double  *cbuf; 

260 

261  tendif 

262  { 
263 

264  int   i, 

265  j, 

266  pos  =  42,  /*     position  of  print  head  */ 

267  rm  =  LIHE_LENGTH  -   10;  /*     right  margin   (see  matrix. h)      */ 

266 

269  long  cubesize       =  pos2(dim), 

270  sizeof _col   =    (long)    (A->rows   *   sizeof (double)) ; 

271 
272 

273     printf ("Distributing  the  columns  of  A  to  the  nodes"); 

274 

275     for  (j  =  0;  j  <  A->cols;  j++)  { 

276 

277  for    (i   =  0;    i   <  A->rows;    i++)   {   cbuf[i]    =  A->matrix[i] [j] ;    } 

276 
279 

260  i  =  j   '/,  cubesize;  /*      column  — >  node   i  */ 

261 

262  #ifdef  TRANSPUTER  /*  node  0  has  to  sort  'em  out   */ 

263 

264         if  (i  <  8)  { 

265 

266  send(0,  (char  *)  cbuf ,  sizeof _col,  cubesize); 

267  } 

266        else  { 

269 

290  send(8,  (char  *)  cbuf,  sizeof _col,  cubesize); 

291  } 
292 

293  #else     /*    iPSC/2   */ 

294 

295  send(i,    (char*)    cbuf,    sizeof _col,    C0L_TYPE) ; 

296 

297  #endif   /*  TRANSPUTER  */ 

296 

299  printf ("."); 

300 
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301 

302 
303 
304 
305 
306 
307 
308 
309 
310 
311 
312 
313 
314 
315 
316 
317 
31h 

a  i  g 

320 
321 
322 
323 
324 
325 
326 
327 
32* 
329 
330 
331 
332 
333 
334 
335 
336 
337 
338 
339 
340 
341 
342 
343 
344 
345 
346 
347 
348 
349 
350 


if  (pos++  >  rm)  { 

pos  =  0; 
printf ("\n")  ; 


printf ("\nColumn  distribution  complete . \n\n") ; 


/*  End  distribute_columns() 


*/ 


FUNCTION  DEFINITION 


/■> =  =  =  =  =  =  =  =  =  = 

* 

*  This  function  prompts  the  user  for  matrix  size  and  type,  then  generates 

*  the  matrix  with  a  call  to  a  function  from  generate. c. 

*/ 


#ifdef  PROTOTYPE 

Double_Matrix_Type  +generate(int  *m,  int  *n) 
#else 

Double_Matrix_Type  *generate(m,  n) 


int 

#endif 
{ 

Double_Matrix_Type  *A; 


int  matrix_type, 

valid       =  FALSE; 


♦m, 
*n; 


printf ("Please  enter  the  number  of  rows  in  A:  "); 
scanf  ("V.d"  ,  m)  ; 
f f lush(stdin) ; 

printf ("\n and  the  number  of  columns  in  A:  "); 

scanf  ("'/.d"  ,  n) ; 
f f lush(stdin) ; 
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351 
352 
353 
354 
355 
356 
357 
358 
359 
360 
361 
362 
363 
364 
365 
366 
367 
366 
369 
370 
371 
372 
373 
374 
375 
376 
377 
378 
379 
380 
381 
382 
383 
384 
385 
386 
387 
388 
389 
390 
391 
392 
393 
394 
395 
396 
397 
398 
399 
400 


print! ("\n\nSelect  from  the  following  list  of  matrices:"); 
while  (Ivalid)  { 


printf ("\n\n"); 

printf ("    y.d.)  qUIT      \n",  SELECT.QUIT     ) 

printf  ("    y.d.)  Identity   \n" ,  SELECT.IDEITITY  ) 

printf ("    */.d.)  Hilbert   \n",  SELECT.HILBERT  ) 

printf ("    y.d.)  Random    \n",  SELECT_RAIDOM   ) 

printf ("    y.d.)  Wilkinson  \nM,  SELECT.VILKIISON) 

printf ("\n>"); 

scanf  (""/.d"  ,  *matrix_type) ; 

f f lush(stdin) ; 


8witch(matrix_type)  { 

case  SELECT.IDENTITY 

case  SELECT.HILBERT 

case  SELECT.RANDOM 

case  SELECT.VILKINSON 

case  SELECT_QUIT 


valid  =  TRUE;    break; 
exit(EXIT_SUCCESS); 


}  /*  end  while()  */ 

switch(matrix_type)  { 

case  SELECT.IDENTITY: 

printf  ("\n\nGenerating  A  =  identity  (*/,d,  '/,d).\n\n",  *m,  *n) ; 

A  =  identity(*m,  *n) ; 
break; 

case  SELECT_HILBERT: 

printf  ("\n\nGenerating  A  =  hilbert('/.d,  '/.d).\n\n",  *m,  *n) ; 

A  =  hilbert(*m,  *n) ; 
break; 

case  SELECT.RAND0M: 

printf  ("\n\nGenerating  A  =  mxrandC/.d,  '/.d).\n\n",  *m,  *n) ; 

A  =  mxrand(*m,  *n) ; 
break; 
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401 
402 
403 
404 
405 
406 
407 
408 
409 
410 
411 
412 
413 
414 
415 
416 
417 
418 
419 
420 
421 
422 
423 
424 
425 
426 
427 
426 
429 
430 
431 
432 
433 
434 
435 
436 
437 
438 
439 
440 
441 
442 
443 
444 
445 
446 
447 
448 
449 
450 


case  SELECT_VILKIMSON: 

printf  ("\n\nGenerating  A  =  Wilkinson  C/A,    '/.d).\n\n",  *m,  *n)  ; 

A  =  wilkin8on(*m,  *n) ; 
break; 


if  (!A)  { 


printf ("generateO :  Allocation  failure  for  the  matrix  A.\n"); 
exit(EXIT_FAILURE); 


return(A) ; 


/*  End  generateO 


*/ 


FUNCTION  DEFINITION 


/* =========== 

* 

*  Collect  timing  data  from  the  nodes.   The  Intel  side  of  this  function 

*  takes  advantage  of  the  host's  ability  to  receive  from  any  node.   The 

*  transputer  side  must  receive  every  node's  information  from  nodes  zero  t 

*  eight  (eight  only  becomes  involved  in  the  case  of  the  hybrid  4-cube) . 
*/ 

#ifdef  PROTOTYPE 

double  **receive_timing_data(int  cubesize) 
#else 

double  **receive_timing_dat a (cubesize) 
int     cubesize; 


#endif 
{ 

double  **dt; 

int     i, 

j; 

long    tlen; 


/*      (double)   version  of  t[][]  */ 


/*  length  of  one  node's  data    */ 
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451 

452  ticks        **t;  /*      raw  timing  data   from  nodes        */ 

453 

454 

455  /* 

456  *  Perform  allocation  for  the  timing  dt  t[][].   The  two-dimensional 

457  *   array  is  indexed  by  node  number  for  the  rows  and  by  event  for  the 
456      *   columns.   For  instance,  t[i][j]  means  the  time  required  for  event 

459  *   j  at  node  i.   Actually,  there  is  an  extra  row  reserved  at  the  end 

460  *  of  t  []  []  for  totals:   t  [cubesize]  [j]  gives  the  total  time  for  event 

461  *   j  across  all  nodes. 

462  */ 
463 

464     if  ( ! (dt  =  (double  **)  malloc( (cubesize+l)  *  sizeof (double*)))){ 

465 

466  printf ("receive_timing_data() :   Allocation  failure  for  dt[][].\n"); 

467  exit(EXIT_FAILURE); 

468  } 
469 

470     for  (i  =  0;  i  <  (cubesize  +  1);  i++)  { 

471 

472         if  (!(dt[i]  =  (double  *)calloc(MAX_EVENTS, sizeof (double) ))){ 

473 

474  printf  ("Host :   Allocation  failure  for  dt  ['/.d]  An",  i) ; 

475  exit(EXIT_FAILURE); 

476  } 

477  } 
476 

479  if    ( ! (t   =    (ticks   **)   malloc((cubesize+l)    *   sizeof (ticks*))))    { 

480 

481  printf ("receive_timing_data() :   Allocation  failure  for  t[][].\n"); 

482  exit(EXIT_FAILURE); 

483  > 
484 

485     for  (i  =  0;  i  <  (cubesize  +  1);  i++)  { 

486 

467  if    (!(t[i]    =    (ticks   *)    calloc(MAX_EVENTS,    sizeof (ticks))))    { 

488 

489  printf  ("Host :   Allocation  failure  for  t['/,d].\n",  i); 

490  exit(EXIT.FAILURE); 

491  } 

492  } 
493 

494     printf ("Receiving  timing  data  from  the  nodes"); 

495 

496  tlen  =    (long)    (MAX_EVEHTS  *   sizeof (ticks)) ; 

497 

498  for    (i  =  0;    i  <   cubesize;    i++)    { 

499 

500  printf ("."); 
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501 

502  #ifdef  TRANSPUTER 

503 

504  if    (i  <  8)   receive(0,    (chair   *)   t  [i]  ,   tlen,    cubesize); 

505  else  receive(8,    (char  *)   t[i],    tlen,    cubesize); 

506 

507  #else     /*   iPSC/2   */ 

508 

509  received,    (char  *)   t[i],    tlen,    (i   +  MODE.OFFSET)) ; 

510 

511  tendif   /*  TRANSPUTER  */ 

512  > 
513 

514  printf("\n\nM); 

515 
516 

517  /*     Calculate  totals,    averages;    place  totals   in  t [cubesize]    first.... 

518  *     then   copy   to  dt [] []    and  record  averages   in  dt [cubesize] . 

519  */ 
520 

521  for    (i   =  0;    i   <   cubesize;    i++)    { 

522 

523  for    (j    =  0;    j    <  MAX_EVENTS;    j++)   t  [cubesize]  [j]    +=  t[i][j]; 

524  } 
525 

526  /*  Fill  dt [] []  with  double  values  (in  seconds).   The  conversion 

527  *  factors  are  borrowed  from  timing. h. 

528  */ 
529 

530  for   (i  =  0;    i  <=   cubesize;    i++)   { 

531 

532  dt[i] [DATA.SOURCE]    =    (double)   t [i] [DATA.SOURCE] ; 

533 

534  for    (j    =   START.TIME;    j    <  MAX.EVENTS;    j++)    { 

535 

536  tifdef  TRANSPUTER 

537 

538  dt[i][j]    =    ((double)   t[i][j])   *  L0.PERI0D; 

539 

540  #else 

541 

542  dt[i][j]    =    ((double)   t[i][j])   *   M_PERI0D; 

543 

544  tendif 

545  } 

546  } 
547 

546     /*  Convert  totals  to  averages  in  dt [cubesize]  */ 

549 

550  for    (j    =  START_TIME;    j    <  MAX_EVENTS;    j++)   { 
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551 

552  dt [cubesize] [j]    /=    ((double)    cubesize); 

553  } 
554 

555 

556  lor    (i   =   0;    i   <    (cubesize  +    1);    i++)   free(t[i]); 

557  free(t); 

558 

559  return(dt) ; 

560  } 

561  /*   End  receive_timing_data()    */ 

562 
563 
564 
565 
566 

567  /* ======  ======        FUNCTION     DEFINITION        =  =  =====  =  ==== 

568  * 

569  *     This   function  analyzes  the   command  line  that   the  user  supplied  and  sets 

570  *     variables   accordingly.      The  valid  arguments   are  given  by  def ine_valid_ 
57i      *     args(),    and  the  real  work   is  passed  off   to  interpret_args() ,   from  the 

572  *     clargs  library. 

573  */ 
574 

575  #ifdef  PROTOTYPE 

576 

577  void  resolve_args(int   argc,    char  *argv[], 

578  int   *dim,    int   *timing,    int   *verbose) 

579 

580  #else 

581 

582  void  resolve_args(argc ,    argv,    dim,    timing,    verbose) 

583 

584  int      argc ; 

585  char   *argv[]  ; 

586  int      *dim, 

587  *timing, 

588  *verbose; 

589 

590  tendif 

591  { 

592  int  maxdim  =  3, 

593  valid     =  FALSE; 

594 
595 

596  interpret_args(argc,    argv,   NUMBER_0F_ARGS,    optv);        /*   see   clargs. h  */ 

597 

598  tifdef  TRANSPUTER 

599 

600  *dim  =  DIMENSION; 
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601 

602  #else     /*   iPSC/2   */ 

603 

604  if    (optv [DIM] -Mound)    *dim  =    (int)    optv[DIM]->lsa[0]  ; 

605 

606  switch   (*dim)    { 

607 

608  case   0:      case   1:      case   2:      case   3:      break; 

609 

610        default:  while  (! valid)  { 

611 

612  printf ("Enter  desired  cube  dimension  (0...%d):  ",  maxdim) ; 

613  scanf  ('"/,d"  ,  dim); 

614  fflush(stdin) ; 

615 

616  switch(*dim)   { 

617  case  0:    case  1:    case  2:    case  3: 

618  valid  =  TRUE; 

619  break; 

620  } 

621  } 

622  }  /*   end  switchO    */ 

623 

624  #endif   /*  TRANSPUTER  */ 

625 

626  (optv [TIMING] ->found)      ?    (*timing     =  TRUE)    :    (*timing     =  FALSE); 

627 

628  (optv  [VERBOSE] ->found)    ?    (*verbose  =  TRUE)    :    (*verbose   =  FALSE); 

629 

630  printf ("Argument  resolution  complete. . .\n\n") ; 

631  printf  ("        Cube  Dimension:   */,d\n" ,  *dim) ; 

632 

633  if    (*timing)      printf ("  Timing:      0N\n"); 

634 

635  (*verbose)    ?    (printf("  Verbose  Mode:      0N\n\n"))    : 

636  (printf ("\n")) ; 

637 

638  } 

639  /*   End  resolve_args()    */ 

640 
641 
642 
643 
644 

645  /* ============        FUNCTION     DEFINITION        ============ 

646  * 

647  */ 
648 

649  #ifdef  PROTOTYPE 

650 
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651  void   8how_resulting_matrices(Double_Matrix_Type   *A, 

652  Double_Matrix_Type   *A0,    int   *q) 
653 

654   #else 
655 

656  void  show_resulting_matrices(A,    AO,    q) 

657 

658  Double_Matrix_Type  *A, 

659  *A0; 

660  int  *q; 

661 

662  tendif 

663  { 

664  Double_Matrix_Type  *D, 

665  *L, 

666  *LU, 

667  *P, 

668  *QT, 

669  *QTA, 

670  *QTAP, 
67i  *U; 

672 

673  int  i , 

674  j  , 

675  m  =  A->rows, 

676  n  =  A->cols; 

677 
678 

679     printf ("Gauss  Factorization  Complete. . .\n\n") ; 

680 

681  strcpy(A->name,    "A    (after  GF  operations)"); 

682 
683 

684  /*     Allocate  and  form  Q'    and  P */ 

685 

686  if    (!(QT  =  matalloc(m,m)))   { 

687 

688  printf ("Allocation  failure  for  QT.\n"); 

689  exit(EXIT_FAILURE); 

690  } 
691 

692  strcpy(QT->name,    "Q  Transpose"); 

693 

694  for    (i   =  0;    i  <  m;    i++)    {     QT->matrix[i] [q[i]]    =1.0;      } 

695 
696 

697  if   (!(P     =   identity(n.n)))   { 

698 

699  printf ("Allocation  failure  for  P.\n"); 

700  exit(EXIT.FAILURE); 
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701  } 

702 

703  strcpy(P->name,    "P        [  Partial    (column)   Pivoting  ==>  P  ==   Identity  ]"); 

704 
705 

706  /*   Here,  we  slowly  form  Q'AP,  keeping  in  mind  that  the  A  we  are 

707  *  talking  about  is  the  original  A.... and  we  have  labeled  that  one 

708  *  A0.   Therefore,  we  first  form  QTA  (Q'A)  as  Q'  *  A0.  After  we 

709  *  have  QTA,  we  can  multiply  it  (on  the  right)  by  P  to  get  Q'AP, 
7io  *  or  QTAP  as  it  is  called  here. 

711       */ 
712 

713     if  (!(QTA  =  matalloc(m.n)))  { 

714 

715  printf ("Allocation  failure  for  QTA.\n"); 

716  exit(EXIT_FAILURE); 

717  } 
716 

719  strcpy(QTA->name,    "Q'    *    (original)   A"); 

720 

721  if    (matrix_product(QT,    A0,   QTA)   ==  FAILURE)    { 

722 

723  printf ("matrix_product(QTA)  Failure. \n") ; 

724  exit(EXIT_FAILURE); 

725  } 
726 

727 

72s  if    (!(qTAP   =  matalloc(m.n)))    { 

729 

730  printf ("Allocation  failure  for  QTAP.\n"); 

731  exit(EXIT.FAILURE); 

732  } 
733 

734  8trcpyf0TAP->name,    "Q'    *   A   *   P"); 

735 

736  if    (matrix_product(QTA,    P,    QTAP)   ==  FAILURE)    { 

737 

738  printf ("matrix_product (QTAP)  Failure. \n") ; 

739  exit(EXIT.FAILURE); 

740  } 
741 

742 

743  /*     lext,    we  form  L  and  U  so  that  we   can  compare  Q'AP  ?=?  LU.  */ 

744 

745  L  =  zeros(m,   n) ;  L->name  =   "L  "; 

746  U  =  zeros(m,   n) ;  U->name   =   "U  "; 

747 

748  for    (i   =   0;    i   <   A->rows;    i++)    { 

749 

750  for    (j    =  0;    j    <  A->cols;    j++)   { 
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751 
752 
753 
754 
755 
756 
757 
756 
759 
760 
761 
762 
763 
764 
765 
766 
767 
766 
769 
770 
771 
772 
773 
774 
775 
776 
777 
776 
779 
780 
781 
782 
783 
784 
785 
766 
787 
788 
789 
790 
791 
792 
793 
794 
795 
796 
797 
798 
799 
800 


if    (i   <   j)    {  U->matrix[i] [j]    =   A->matrix[i] [j] ;    } 
if    (i   ==   j)    { 

L->matrix[i] [j]    =    1.0; 
U->matrix[i] [j]    =   A->matrix [i] [j] ; 
} 

if    (i   >   j)    {  L->matrix[i] [j]    =   A->matrix[i] [j] ;    > 


if    (!(LU  =  matalloc(m.n)))    { 

printf ("Allocation  failure  for  LU.\n"); 
exit(EXIT_FAILURE); 


strcpy(LU->name,  "L  *  U"); 

if  (matrix_product(L,  U,  LU)  ==  FAILURE)  { 

printf ("matrix_product(LU)  Failure  An") ; 
exit(EXIT_FAILURE); 


/*  Finally,  we  create  a  matrix  of  differences  between  the  elements 

*  found  in  QTAP  (Q'AP)  and  LU.   If  everything  proceeded  according 

*  to  the  plan,  this  will  be  a  matrix  of  zeros. 
*/ 

if  (!(D  =  matalloc(m,n)))  { 

printf ("Allocation  failure  for  D.\n"); 
exit(EXIT_FAILURE); 


strcpy(D->name,  "Q'AP  -  LU"); 

for  (i  =  0;  i  <  m;  i++)  { 

for  (j  =  0;  j  <  n;  j++)  { 

D->matrix[i] [j]    =    (QTAP->matrix[i] [j]    -  LU->matrix[i] [j] ) ; 
> 
> 

printmd(*A,    WIDTH,    AFT); 
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801  print* ("\n\n") ; 

802  printmd(*L,    WIDTH,    AFT); 

803  print* ("\n\n") ; 

804  printmd(*U,    WIDTH,    AFT); 

805  printf("\n\n"); 

806 

807  printad(*QT,    WIDTH,    AFT); 

808  print! ("\n\n") ; 

809  printmd(*P,   WIDTH,    AFT); 

810  print! ("\n\n"); 

811  printmd(*QTA,   WIDTH,    AFT); 

812  print! ("\n\n"); 

813  printmd(*QTAP,   WIDTH,    AFT); 

814  print! ( "\n\n" ) ; 

815  printmd(*LU,    WIDTH,    AFT); 

816  print!("\n\n") ; 

817  printmd(*D,    WIDTH,    AFT); 

818  print!("\n\n") ; 

819 

820    } 

621   /*   End  show_resulting_matrices()    */ 

822 
823 
824 
825 
826 

827  /* ============        FUNCTION     DEFINITION        ============ 

828  * 

829  *  This  is  a  simple  !unction  to  physically  swap  the  elements  !rom  row  s  to 

830  *  the  current  pivot  row,  r.   It  does  not  concern  itsel!  with  column  r  or 
83i  *  any  column  j  >  r. 

832   */ 
833 

834  #i!de!  PROTOTYPE 

835 

836  void   swap_rows_le!t_o!_pivot(Double_Matrix_Type   *A,    int  r,    int   s) 

837 

838  #else 

839 

840  void  swap_rows_le!t_o!_pivot(A,   r,    s) 

841 

842  Double_Matrix_Type  *A; 

843  int  r, 

844  s; 

845 

846  #endi! 

847  { 

848  double  tmp; 
849 

850  int   j ; 
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851 
852 

853  lor    (j    =   0;    j    <  r;    j  +  +  )    { 

854 

855  tmp  =   A->matrix[r]  [j]  ; 

856  A->matrix[r]  [j]    =   A->matrix[s]  [j]  ; 

857  A->matrix[s]  [j]    =  tmp; 

858  } 
859 

860  } 

861  /*   End  swap_rows_left_of_pivot()    */ 

862 
863 
864 
865 
866 

867  /* ===  =  =  ===  ===        FUNCTION     DEFINITION        ====  =====  =  = 

868  * 

869  *  This  function  performs  updates  to  a  permutation  vector,  v[],  of  length 

870  *  'size'.   The  pivot_index  indicates  the  row  or  column  where  the  next 

871  *  pivot  has  been  located;  and  k  indicates  the  stage,  or  the  row  and 

872  *  column  where  the  pivot  is  to  be  installed. 

873  */ 
874 

875  #ifdef  PROTOTYPE 

876 

877  void  update_permutation(int  v [] ,    int   size,    int  k,    int  pivot_index) 

878 

879  #else 

880 

881  void  update_permutation(v,    size,   k,   pivot_index) 

882 

883  int  v  []  , 

884  size, 

885  k, 

886  pivot.index; 

887 

888  fendif 

889  { 

890  int   i ; 

891 
892 

893  i   =   v[k];  v[k]    =  v[pivot_index]  ;  v[pivot_index]    =   i; 

894  } 

895  /*   End  update_permutation()   */ 

896 
897 
898 
899 
900 
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901 
902 
903 
904 
905 
906 
907 
908 
909 
910 
911 
912 
913 
914 
915 
916 
917 
918 
919 
920 
921 
922 
923 
924 
925 
926 
927 
926 
929 
930 
931 
932 
933 
934 
935 
936 
937 
938 
939 
940 
941 
942 
943 
944 
945 
946 
947 
948 
949 
950 


tifdef  PROTOTYPE    /*  =================================================  */ 

main(int  argc ,  char  *argv[]) 

#else 

main(argc,   argv) 

int     argc; 
char   *argv[]  ; 

# end if 
{ 


/* 


VARIABLE  DEFINITIONS 


double  a, 

*cbuf , 

**dtime, 

eps 

8 

root_time, 

tol; 


/* 
/* 
/* 

=  epsd(),  /* 

=  0.0,    /* 

/* 

/* 


Double_Matrix_Type  *A, 
*A0; 

int   cubesize, 
dim, 
i. 

J. 
m, 
me, 
n, 

*q. 

r, 

timing, 
verbose; 

long  sizeof_col, 
sizeof_int, 
sizeof _pivot ; 


/* 


ticks 


root_start , 
t_root , 
**t; 


Pivot_Type  pivot; 


*/ 

*/ 
*/ 

*/ 
*/ 
*/ 
time  measured  at  root  for  iterations  */ 


denominator  of  growth  factor  (g/a) 
col  buffer  holds  one  col  at  a  time 
doubles  corresponding  to  ticks  **t 
machine  precision  (see  machine. h) 
the  growth  factor 


tolerance 

This  A  gets  operated  upon/changed 
The  original  copy  of  A 


/*  number  of  processors  in  the  cube 
/*  dimension  of  the  hypercube 


/*  number  of  rows  in  A 

/*  root  processor's  id 

/*  number  of  cols  in  A 

/*  row  permutation  vector 

/*  numerical  rank  estimate 

/*  Boolean 

/*  Boolean 


/*   sizes,  in  bytes 


*/ 

*/ 
*/ 

*/ 
*/ 


*/ 
*/ 
*/ 
*/ 
*/ 
*/ 
*/ 

*/ 


/*  time  measured  at  root  transputer     */ 
/*  time  data:  row  =>  node,  col  =>  event  */ 


/*  pivot 


*/ 
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951    /* =  =  =  =  =  =  =  =  =  =  ===        INITIALIZATIONS        ===  =  =  ===  =  =  === */ 

952 

953  #ifdef  TRANSPUTER 

954 

955  /*      Add    1M   to  the   heap   to   alios   for  generation  of    large  matrices  ♦/ 

956  addfree((void   *)    _heapend,    0x100000); 

957 

956  #endif 

959 

960  print!  ("\n'/,s\n\n"  ,   version); 

961 

962  def ine_valid_args() ; 

963 

964  resolve_args(argc,    argv ,    Adim,    ttiming,    ftverbose); 

965 

966  A   =   generate(tm,    kn) ; 

967 

966  sizeof_col  =    (long)    (A->rows   *    sizeof (double)) ; 

969  sizeof _int  =    (long)    sizeof (int); 

970  sizeof.pivot     =    (long)    sizeof (Pivot_Type) ; 

971 

972  if    (!(cbuf   =    (double   *)   malloc(sizeof _col)) )    { 

973 

974  printf ("main() :   Allocation  failure  for  cbuf[].\n"); 

975  exit(EXIT.FAILURE); 

976  } 
977 

978  cubesize   =   P0W2(dim); 

979 

980  #ifdef  TRANSPUTER 

981 

982  initialize_hypercube(dim) ; 

983 

984    #else 

985 

986  cubename   =   initialize_hypercube(dim,   nodecode); 

987 

988  fendif 

989 
990 

991  Be   =  myhost() ; 

992 

993  if    (verbose)    { 

994 

995  if    ( ! (A0   =  matalloc(m,n)))    { 

996 

997  printf ("Allocation  failure  for  A0.\n"); 

998  exit(EXIT_FAILURE); 

999  } 
1000 
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1001  strcpy(A0->name,    "Original  A"); 

1002 

1003  lor   (i  =  0;    i  <  A->rows;    i++)   { 

1004  lor   (j    =  0;    j   <  A->cols;    j++)    { 

1005 

1006  A0->matrix[i]  [j]    =  A->matrix[i]  [j]  ; 

1007  > 

1008  } 

1009  printl("\n\nA  has  been  allocated  and  generated. \n\n") ; 
ioio  printmd(*A,    WIDTH,    AFT); 

ion  printl("\n\nSending  size(A)   to  the  nodes .\n\n") ; 

1012  } 

1013 

1014 

1015  #ildel  TRANSPUTER 

1016 

1017  cubecast(me,  dim,  (char  *)   km,              sizeol_int,    cubesize); 

1016  cubecast(me,  dim,  (char  *)  kn,  sizeol_int,  cubesize); 
1019  cubecast(me,  dim,  (char  *)  ttiming,  sizeol_int,  cubesize); 
1020 

1021   #else     /*      iPSC/2     */ 

1022 

1023  cubecast(me,    dim,    (char  *)    *m,  sizeol_int,    R0W_SIZE_TYPE) ; 

1024  cubecast(me,    dim,    (char  *)   tn,  sizeol.int,    C0L_SIZE_TYPE) ; 

1025  cubecast(me,    dim,    (char  *)   Jttiming,    sizeol_int,    ARG.TYPE) ; 

1026 

102"  #endil 

1026 

1029  il    (verbose)   printl("\nSent  size(A)   to  nodes. \n"); 

1030 

1031  distribute_columns(A,    dim,    cbul); 

1032 

1033  q  =   initial_permutation_vector(m) ; 

1034 
1035 

1036  /*  FINAL  PREPARATIONS  BEFORE  STARTING  THE  ITERATION  

1037  * 

1036  *  Get  the  lirst  pivot  Irom  node  0.   Initialize  the  growth  lactor 

1039  *  variables,  g  and  a,  so  that  we  can  compute  growth  lactor  (g/a)  as 

1040  *   we  go.   Set  a  reasonable  tolerance. 

1041  * 

1042  * 

1043  */ 
1044 

1045  #ildel  TRANSPUTER 
1046 

1047  receive(0,    (char  *)   tpivot,    sizeol_pivot ,    cubesize); 

1046 

1049  #else     /*   iPSC/2  */ 

1050 
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1051  receive(0,    (char   *)   tpivot,    sizeof _pivot ,   PIV0T_TYPE) ; 

1052 

1053  #endif   /*   TRAHSPUTER   */ 

1054 
1055 

1056  a  =  g  =   MAX(g,    f abs(pivot .u)) ; 

1057 

1058  tol   =    (MIN(m.n))    *   g   *    eps ; 

1059 

1060 

1061  /*   BEGINNING  OF  ITERATION  

1062  * 

1063  *  We  enter  with  A  established  and  knowledge  of  the  first  pivot. 

1064  * 

1065  *  

1066  */ 
1067 

1066  #ifdef  TRANSPUTER 
1069 

1070  root_start   =   clockO; 

1071 

1072  #endif 

1073 

1074  printf ("Beginning  iterations .\n\n") ; 

1075 

1076  for    (r   =  0;    r   <    (MIN(m.n));    r++)    { 

1077 

1076  if    (pivot. id   ==   RANK_DEFICIENT)    break; 

1079 

1080  /*  We  expect  to  receive  cbuf []  in  the  correct  (i.e.,  already 

1061  *  swapped)  order.   Before  we  stuff  cbuf  D  into  A[][],  we'll  swap 

1082  *  rows  left  of  the  pivot  column,  and  then  insert  the  new  pivot 

1083  *  column. 
1064  */ 

1065 

1086  #ifdef  TRANSPUTER 

1087 

1088  receive(0,    (char  *)   cbuf,    sizeof_col,    cubes ize); 

1089 

1090  #else     /*    iPSC/2   */ 

1091 

1092  receive(0,    (char   *)    cbuf,    sizeof_col,    PC0L_TYPE) ; 

1093 

1094  #endif   /*  TRANSPUTER  */ 

1095 

1096  g  =  MAX(g,   fabs(pivot.u)) ; 

1097 

1096  update_permutation(q,   m,    r,   pivot. s); 

1099 

lioo  if    (pivot . s    !=  r)      swap_rows_lef t_of _pivot(A,    r,    pivot. s); 
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1101 

1102  for    (i  =  0;    i   <  A->rows;    i++)    {  A->matrix[i]  [r]    =   cbuf[i];    } 

1103 

1104  if    (verbose)    { 

1105 

1106  printf  ("Host :      Stage  */,d,   Pivot   value   =  y,e.      ",    r,   pivot,  u); 

1107  printf  ("Growth  factor  =  '/,e.\n",    (g/a)); 

1108  printf ("q  =   ");    printvi(q,   A->rows,    WIDTH); 

1109  printf ("\n"); 

1110 

mi  } 

1112 

ni3  if    (r  <   ((HIN(m.n))    -   1))    { 

1114 

1115  #ifdef  TRANSPUTER 

1116 

ni7  receive(0,  (char  *)  tpivot,  sizeof_pivot ,  cubesize); 

1118 

1119  #else     /*    iPSC/2   */ 

1120 

1121  receive(0,    (char   *)   ftpivot,    sizeof .pivot ,    PIVOT.TYPE) ; 

1122 

1123  #endif   /*  TRANSPUTER   */ 

1124 

1125  } 

1126 

1127  }   /*   end  for(r)    */ 

1128 

1129  #ifdef  TRANSPUTER 

1130 

1131  t_root   =    (clock()   -  root_start); 

1132 

1133  if    (timing)    { 

1134 

1135  root_time   =    ((double)   t.root)    *  L0_PERI0D; 

1136 

1137  printf ("\n\nRoot  transputer:   "); 

1138  printf  ("Time  for  iterations:  */.8.41f  seconds\n\n" ,  root.time); 

1139  } 
1140 

1141  #endif 

1142 
1143 

1144     free(cbuf); 

1145 

1146 

1147  /*   I  have  selected  the  easy  way  out  and  assumed  A  has  full  rank.   If 

1148  *  you  did  not  make  this  assumption,  you  would  need  to  collect  the 

1149  *  remaining  columns  at  this  point. 

1150  */ 
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1151 

1152  if    (timing)    dtime   =   receive_timing_data(cubesize) ; 

1153 

1154 

1155  /*   There  is  no  more  use  for  the  nodes,  so  they  can  be  released.   */ 

1156 

1157  #ifndef  TRAISPUTER 

1156  printf ("\n\nmain() :     Killing  and  releasing  cube.\n\n"); 

1159  killcube(ALL_IODES,   ALL.PIDS) ; 

1160  relcube(cubename) ; 

1161  tendii 

1162 

1163  if    (verbose)    {     /*      Create  and   show   q',    A0,    P,   L,    U    ....    */ 

1164 

1165  show_resulting_matrices(A,    A0,    q) ; 

1166 

1167  } 

1166 

1169 

1170  if    (timing)   display_timing_data(A,    dim,    a,    eps,    g,   tol,    r,    dtime); 

1171 

1172  } 

1173  /* =============      EOF     gfpphost.c      ===========  = */ 
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IIFORMATIOI 


1 

2 
3 

* 
* 

SOURCE 

gfppnode. c 

4 

* 

VERSION 

2.0 

5 

* 

DATE 

21  September  1991 

6 

* 

AUTHOR 

Jonathan  E.  Hartman,  U 

7 

* 

REMARKS 

See  gf.h. 

6 

* 

S.  laval  Postgraduate  School 


9      * ========================== 

10      */ 

n 

12  #include  <math.h> 

13 

H  #iidef  TRANSPUTER 

15 

16  #include  <conc.h> 

17 

is  #include  <matrix.h> 

19  #include  <macros.h> 

20  tinclude  <allocate.h> 

21  #include  <comm.h> 

22  #include  <generate.h> 

23  tinclude  <mathx.h> 

24  #include  <ops.h> 

25  #include  <timing.h> 

26 

27  #else 

26 

29  #include  "/usr/hartman/matlib/matrix.h" 

30  #include  "/usr/hartman/matlib/macros.h" 
3i  ftinclude  "/usr/hartman/matlib/allocate.h" 

32  tinclude  "/usr/hartman/matlib/comm.h" 

33  ^include  "/usr/hartman/matlib/generate.h" 

34  #include  "/usr/hartman/matlib/mathx .h" 

35  #include  "/usr/hartman/matlib/ops.h" 

36  #include  "/usr/hartman/matlib/timing.h" 

37  tendif 

38 

39  #include  "gf .h" 

40 

41  iildel  TRANSPUTER 

42 

43  Channel   *ic[(CUBESIZE  +   1)], 

44  *oc[(CUBESIZE  +    1)]  ; 

45 

46   #endif 

47 
48 

49  ticks  t[MAX_EVENTS] ; 

50 


342 


gfppuode.c 


51 

52 

53 

54 

55 

/* 

56 

* 

57 

* 

58 

* 

59 

* 

60 

* 

61 

*/ 

FUNCTION  DEFINIT10H 


This  function  is  kind  of  an  inverse  lor  local_column() .   Given  some 
column  number  (local_column)  held  at  this  node,  the  function  returns 
the  corresponding  column  number  in  the  global/host  copy  of  the  full- 
sized  A.   This  could  be  implemented  more  efficiently  as  a  macro. 
*/ 

62 

63  fifdef  PROTOTYPE 

64 

65  int   global_column(int   local_column,    int  me,    int   cubesize) 

66 

67  #else 

68 

69     int  global_column(local_column,  me,  cubesize) 

70 

7i  int   local_column, 

72  me, 

73  cubesize; 

74 

75  #endif 

76  { 

77  return(local_column  *   cubesize  +  me) ; 

76    > 

79  /*   End  global_column()    */ 

80 
81 
82 
83 
84 

85  /* ===========   FUNCTION  DEFINITION   =========== 

86  * 

87  *  This  function  maps  a  column  number  in  the  global  A  (the  full-sized  A 

88  *  held  at  the  root  processor/host)  to  the  corresponding  local  column  num- 

89  *  ber.   If  the  global_column  is  not  one  that  is  held  at  this  node,  a 

90  *  negative  value  (-1)  is  returned. 

91  */ 

92 

93  #ifdef  PROTOTYPE 

94 

95     int  local_column(int  global_column,  int  me,  int  cubesize) 

96 

97   #else 
98 

99  int   local_column(global_column,   me,    cubesize) 

100 
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101 
102 
103 
104 
105 
106 
107 
108 
109 
110 
111 
112 
113 
114 
115 
116 
117 
118 
119 
120 
121 
122 
123 
124 
125 
126 
127 
126 
129 
130 
131 
132 
133 
134 
135 
136 
137 
138 
139 
140 
141 
142 
143 
144 
145 
146 
147 
148 
149 
150 


int  global_column, 
me  , 

cubesize; 
#endif 
{ 

if  ((global .column  '/.  cubesize)  !=  me)   retum(-l); 

retura((int)  global .column  /  cubesize); 


/*  End  local_column() 


*/ 


FUNCTION  DEFINITION 


/* == 

*/ 
#ifdef  PROTOTYPE 


void  do_pivot_column_arithmetic(Double_Matrix_Type  *A,  double  *cbuf, 

int  k,  int  me,  int  cubesize) 


#else 


void  do_pivot_column_arithmetic(A,  cbuf ,  k,  me,  cubesize) 

Double_Matrix_Type  *A; 
double  *cbuf; 

int  k, 

me , 

cubesize; 

#endif 

double  pivot_value; 

int     i, 

pivot_column; 


pivot_column  =  local_column(k,  me,  cubesize); 
pivot_value  =  A->matrix[k]  [pivot_column] ; 


/*  Divide  everything  under  the  pivot  by  the  pivot  value 
lor  (i  =  (k+1);  i  <  A->roHs;  i++)  { 


*/ 
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151  A->matrix[i] [pivot_column]    /=   pivot_value; 

152  } 
153 

154 

155  /*  This  is  somewhat  redundant,  and  not  optimal  with  respect  to 

156  *   efficiency,  but  it  works  and  reads  clearly,  right? 

157  */ 
158 

159  for    (i   =   0;    i   <   A->rows;    i++)     cbuf[i]    =   A->matrix[i] [pivot_column] ; 

160 

161  } 

162  /*   End  do_pivot_column_arithmetic()   */ 

163 
164 
165 
166 
167 

168  /* ===========        FUNCTION     DEFINITION        =========== 

169  * 

170  *  This  function  accepts  the  matrix,  the  global  column  number  for  this 

171  *  stage  (where  the  pivot  will  be  taken  from),  and  a  pivot  structure  to  be 

172  *  f illed. .. .among  other  things ... .and  'returns'  the  row,  s,  and  value,  u, 

173  *  of  the  new  pivot  in  global  column  r  (local  column  lc) . 

174  */ 
175 

176  #ifdef  PROTOTYPE 

177 

178  void   locate_pivot(int  me,    int   cubesize,    Double_Matrix_Type   *A,    int  r, 

179  Pivot_Type  *pivot) 

180 

181  #else 

182 

183  void  locate_pivot(me,    cubesize,    A,    r,   pivot) 

184 

185  int  me, 

186  cubesize; 

187  Double_Matrix_Type  *A; 

188  int  r; 

189  Pivot_Type  *pivot; 

190 

191  #endif 

192  { 

193  int   i, 

194  pivot_column; 

195 
196 

197  pivot_column  =   local_column(r,   me,    cubesize); 

198 

199  /*     Initialize  pivot  row  and  value  */ 

200  pivot->s  =  r; 
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201  pivot->u  =  A->matrix[r]  [pivot_column]  ; 

202 
203 

204  lor   (i   =    (r+1);    i  <  A->rows;    i++)    { 

205 

206  if    (f abs(A->matrix[i] [pivot_column] )   >  f abs(pivot->u))    { 

207 

208  pivot->s  =  i; 

209  pivot->u  =  A->matrix[i] [pivot_column] ; 

210  } 

211  } 

212  } 

213  /*   End  locate_pivot()    */ 

214 
215 
216 
217 
218 

219  /* ===========        FUNCTION     DEFINITION        =========== 

220  * 

221  *  Receive  this  node's  columns  from  the  root/host  processor  (manager), 

222  *  place  them  into  the  column  buffer,  then  transfer  them  into  A  while 

223  *  the  other  processors  are  communicating  with  the  root. 

224  * 

225  *  The  transputer  scheme  is  a  bit  more  involved.   Here  nodes  0000  and  1000 

226  *  are  connected  to  the  root  and  they  must  receive  for  everyone.   They  (0 

227  *  and  8)  are  not  directly  connected  to  everyone,  so  the  columns  must  be 

228  *  passed  out  in  cycles.   For  instance,  suppose  He  used  the  hybrid  4-cube. 

229  *  Then  nodes  0  and  8  would  receive  bursts  of  8  columns  at  a  time.   They 

230  *  would  keep  the  first  one  (we'll  call  it  column  0  in  some  sort  of  rela- 

231  *  tive  numbering  scheme  that  abides  by  the  C  numbering  convention) ,  send 

232  *  the  next  one  (col  1)  in  the  0x1  direction,  the  next  to  the  0x2  direc- 

233  *  tion,  column  3  in  the  0x1  direction,  column  4  in  the  0x4  direction, 

234  *  column  5  in  the  0x1  direction,  column  6  in  the  0x2  direction,  and 

235  *  lastly,  column  7  in  the  0x1  direction.   This  makes  cycle  ==  8  for  nodes 

236  *  0000  and  1000.   Similarly,  nodes  xOOl  have  a  cycle  of  four  where  they 

237  *  keep  the  first  column  to  arrive  and  then  send  the  next  three  to  direc- 

238  *  tions  0x2,  0x4,  and  0x2  in  turn.   This  distribution  pattern  is  main- 

239  *  tained  until  all  of  the  columns  have  been  distributed. 

240  */ 

241 

242  #ifdef  PROTOTYPE 
243 

244  void  receive_columns(int  dim, 

245  int  node, 

246  Double_Matrix_Type  *A, 

247  int  n, 

248  double  *cbuf, 

249  int  my_cols, 

250  int  colsize) 
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251 
252 
253 
254 
255 
256 
257 
258 
259 
260 
261 
262 
263 
264 
265 
266 
267 
268 
269 
270 
271 
272 
273 
274 
275 
276 
277 
278 
279 
280 
261 
282 
263 
264 
285 
286 
287 
268 
289 
290 
291 
292 
293 
294 
295 
296 
297 
298 
299 
300 


#else 


void  receive_columns(dim,  node,  A,  n,  cbuf,  ny_cols,  colsize) 

int  din, 

node ; 
Double_Matrix_Type  *A; 
int  n; 

double  *cbuf; 

int  my_cols, 

colsize; 


#endif 
{ 


int  cubesize  =  pow2(dim), 
cycle, 


dimef f 

=  MIN(3,  dim) 

from, 

gc. 

l , 

idx, 

lc 

=  o, 

ldeff, 

nodef f 

=  (node  */,  8) , 

others , 

step, 

thehost 

=  myhostO  , 

to  [8]; 

/*  length  of  typical  col  burst   */ 

/*  effective  dimension  */ 

/*  node  that  I  receive  from     */ 

/*  global  column  index  */ 

/*  index  into  to[]  */ 

/*  local  column  index  */ 

/*  effective  least_dimension()   */ 

/*  effective  node  number        */ 

/*  no.  of  nodes  in  other  3-cube  */ 

/*  for  destination  of  cols  rec'd*/ 


/*   ==>  direction  to  send  to 


*/ 


#ifdef  TRANSPUTER 

ldeff  =  least_dimension(nodef f ) ; 

if  (nodef f  ==  0)  from  =  myhostO; 

else  from  =  node  *  pow2(ldeff 


l); 


/*   cycle  describes  the  length  of  a  cycle  that  starts  with  me  (node) . . . 
*  then  I  receive  several  columns  for  others ... .then  start  over  with 
me.   The  nodes  in  the  highest  dimension  have  cycle  ==  1  ==>  self 
only.   We  also  fill  to[]  with  the  directions  that  we  will  be 
sending  to  within  a  given  cycle.   lot  all  nodes  use  all  8  elements 
of  to[].   They  only  use  the  first  cycle  elements.   The  step  is  the 
difference  between  the  column  numbers  received  at  this  node  during 
a  given  burst  of  length  cycle. 


When  we  use  the  hybrid  4-cube,  we  are  treating  it  as  two  3-cubes, 
so  the  variable  others  is  set  to  8.   This  is  because  there  are  8 
other  columns  between  every  burst  that  comes  to  the  3-cube  that 
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301  *     node   is   in. 

302  */ 

303  cycle   =  pow2(dimeff   -   ldeff); 

304 

305  (dim  ==  4)    ?    (others   =   8)    :    (others   =  0); 

306 

307  step     =  pow2(ldeff); 

308 

309  to[0]    =   0; 

310  to[l]    =   to[3]    =  to[5]    =   to[7]    =  poo2(ldeff); 

3ii  to[2]    =   to[6]    =  po¥2(ldeff   +   1); 

312  to  [4]    =  pow2(ldeff   +  2); 

313 
314 

315  lor   (gc   =  node;    gc   <  n;    gc   +=    (others  +   step))   { 

316 

317  receive(from,    (char  *)   cbuf ,    colsize,    cube6ize); 

318 

319  for    (i   =  0;    i   <  A->rows;    i++)   A->matrix[i]  [lc]    =   cbuf[i]; 

320 

321  lc  +  +; 

322 

323  for    (idx   =   1;    idx   <   cycle;    idx++)    { 

324 

325  gc   +=   step; 

326 

327  if    (gc   <  n)   { 

328 

329  receive(from,    (char  *)    cbuf,    colsize,    cubesize); 

330 

331  directional_send(node,   dim,   to [idx],    (char*)   cbuf,    colsize); 

332  } 

333  } 
334 

335     }  /*  end  for(gc)  */ 

336 

337 

338  #else  /*  iPSC/2  */ 

339 

340     for  (lc  =  0;  lc  <  my_cols;  lc++)  { 

341 

342        receive (thehost,  (char  *)  cbuf,  colsize,  C0L_TYPE) ; 

343 

344  for    (i   =  0;    i   <  A->rows ;    i++)   {     A->matrix[i] [lc]    =  cbuf[i];      } 

345  } 
346 

347  #endif   /*  TRAHSPUTER  */ 

348 

349  } 

350  /*  End  receive_columns()   */ 

348 
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351 
352 
353 
354 
355 

356  /* ===  ========        FUHCTIOH     DEFIIITIOI        ==========  = 

357  * 

356  *  This  function  sends  in  the  timing  data  that  is  held  in  t[]. 

359   */ 

360 

361   #ifdef  PROTOTYPE 

362 

363  void   submit_timing_data(int  node,    int  dim) 

364 

365  #else 

366 

36"  void   submit_timing_data(node,    dim) 

368 

369  int  node, 

370  dim; 

371 

372  #endif 

373  { 

374  int     dimeff        =  MIN(dim,    3), 

375  dir, 

376  i, 

377  Id  =   least_dimension(node   '/,  8), 

378  nodef  f        =    (node  '/.  8)  , 

379  root  =  myhost(); 

380 

381  long   cubesize   =  pos2(dim), 

382  tlen; 

363 
384 

385  tlen  =    (long)    (MAX.EVENTS   *   sizeof (ticks)) ; 

386 

387  #ifdef  TRANSPUTER 

388 

389  submit(node,    dim,    (char  *)    t,    tlen,    cubesize); 

390 

391  if    (dimeff   ==  Id)   return; 

392 

393  if    ( (nodef f   ==2)    II    (nodef f   ==3))    { 

394 

395  if    (dimeff   >  2)    { 

396  directional_receive(node,    dim,    0x4,    (char  *)   t,   tlen); 

397  submit(node,   dim,    (char  *)   t,    tlen,    cubesize); 

398  } 

399  return; 

400  } 
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401 

402  if    (nodeff   ==   1)    { 

403 

404  if    (dimeff   >   1)   { 

405 

406  directional_receive(node,   dim,    0x2,    (char  *)   t,   tlen); 

407  submitCnode,    dim,    (char  *)   t,    tlen,    cubesize); 

408  } 

409 

410  if    (dimeff   >  2)    { 

411 

412  directional_receive(node,   dim,    0x4,  (char  *)   t,   tlen); 

413  8ubmit(node,    dim,    (char   *)   t,    tlen,  cubesize); 

414  directional_receive(node,    dim,    0x2,  (char  *)   t,   tlen); 

415  8ubmit(node,    dim,    (char  *)   t,    tlen,  cubesize); 

416  } 
417 

418  return; 

419  } 
420 

421  if    (nodeff   ==  0)    { 

422 

423  if    (dimeff   >   0)    { 

424 

425  /*     retrans   from  1   or  9     */ 

426  directional_receive(node ,    dim,    0x1,    (char   *)   t,   tlen); 
42"  submit (node,    dim,    (char  *)   t,    tlen,    cubesize); 

428  } 

429 

430  if    (dimeff   >   1)    { 

431 

432  /*     retrans   from  2  or   10     */ 

433  directional_receive(node,   dim,    0x2,    (char  *)   t,   tlen); 

434  submit (node,    dim,    (char   *)   t,    tlen,    cubesize); 

435  /*     retrans   from  3  or   11     */ 

436  directional_receive(node,    dim,    0x1,    (char  *)   t,   tlen); 

437  submit (node,    dim,    (char  *)   t,    tlen,    cubesize); 

438  } 
439 

440  if    (dimeff   >  2)    { 

441 

442  /*     retrans  from  4  or   12     */ 

443  directional_receive(node,    dim,    0x4,    (char  *)   t,   tlen); 

444  submit(node,    dim,    (char  *)   t,    tlen,    cubesize); 

445  /*     retrans   from  5  or   13     */ 

446  directional_receive(node,   dim,   0x1,    (char  *)   t,   tlen); 

447  submit(node,    dim,    (char  *)   t,    tlen,    cubesize); 

448  /*     retrans   from  6  or   14     */ 

449  directional_receive(node,   dim,    0x2,    (char  *)   t,   tlen); 

450  submit(node,    dim,    (char  *)   t,    tlen,    cubesize); 
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451 
452 
453 
454 
455 
456 
457 
458 
459 
460 
461 
462 
463 
464 
465 
466 
467 
468 
469 
470 
471 
472 
473 
474 
475 
476 
477 
476 
479 
460 
481 
462 
483 
464 
485 
486 
487 
486 
489 
490 
491 
492 
493 
494 
495 
496 
497 
496 
499 
500 


/*  retran6  from  7  or  15  

directional_receive(node,  dim,  Oxl,  (char  *)  t,  tlen) ; 
8ubmit(node ,  dim,  (char  *)  t,  tlen,  cubesize); 


♦/ 


#else  /•  iPSC/2  */ 

delay (1.0  +  2.0  *  (float)  node); 

send(root,  (char  *)  t,  tlen,  (node  +  I0DE_0FFSET)) ; 
#endif  /*  TRAISPUTER  */ 


/*  End  submit_timing_data() 


*/ 


/* ===========   FUNCTION  DEFINITION   =========== 

*  This  function  performs  the  required  operations  on  the  Gauss  Transform 

*  area,  G ,  of  A  and  searches  for  the  next  pivot. 
*/ 

#ifdef  PROTOTYPE 

void  update_G(Double_Matrix_Type  *A,  double  *cbuf, 

int  cubesize,  int  k,  int  me,  int  n,  Pivot_Type  *pivot) 

#else 

void  update_G(A,  cbuf ,  cubesize,  k,  me,  n,  pivot) 

Double_Matrix_Type  *A; 


double 
int 


Pivot_Type 

#endif 
{ 

int  i, 

3. 

gc  =  0, 


♦cbuf ; 
cubesize , 
k, 
me , 
n; 
♦pivot; 


/*  global  column  number 


*/ 
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501 
502 
503 
504 
505 
506 
507 
508 
509 
510 
511 
512 
513 
514 
515 
516 
517 
518 
519 
520 
521 
522 
523 
524 
525 
526 
527 
528 
529 
530 
531 
532 
533 
534 
535 
536 
537 
538 
539 
540 
541 
542 
543 
544 
545 
546 
547 
546 
549 
550 


lc  =  0; 
ticks  start; 


/*   local  column  number  to  start  */ 


while  ((gc  =  global_column(lc,  Be,  cubesize))  <=  k)  lc++; 

/*  The  pivot  row  is  k  and  we  know  that  lc  is  the  first  local  column  to 

*  the  right  of  k.   low  we  must  move  through  the  Gauss  Transform  area, 

*  all  A(i,j)  where  i  >  k  and  j  >  k,  and  perform  the  operation: 
* 

*  A(i,j)  =  A(i,j)  -  A(i,k)  *  A(k,j)  <==>  A(i,j)  -=  cbuf [i]*A(k, j) 
*/ 

start  =  clock(); 

for  (i  =  k+1;  i  <  A->rows;  i++)  { 

for  (j  =  lc;  j  <  A->cols;  j++)  { 

A->matrix[i]  [j]  -=  (cbufCi]  *  A->matrix[k]  [j]  ) ; 

>  /*  end  for(j)  */ 

}  /*  end  for(i)  */ 

t[L00PTIME]  +=  (clock()  -  start); 

} 

/*  End  update_G()  */ 

main(){ 

double  *cbuf;  /*  column  buffer  holds  one  col  of  A     */ 

Double_Matrix_Type  *A;      /*   this  node's  portion  of  the  matrix  A  */ 

int  cubesize,  /*  number  of  processors  in  the  cube     */ 

dim,  /*  dimension  of  the  hypercube  */ 

gc,  /*  global  column  number  */ 

i,  /*  generic  integer  and  row  ctr  */ 

j ,  /*  generic  integer  and  col  ctr  */ 

k,  /*  index  to  pivot  */ 

m,  /*  number  of  rows  in  A  (same  local/all)  */ 

me,  /*  id  of  this  processor  */ 
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551  my_cols   =0,  /*  number  of  cols   in  local  portion  of  A   */ 

552  n,  /*  number  ol  cols   in  all   of   A  */ 

553  root,  /*  host/root  processor   id  */ 

554  timing;  /*  Boolean  */ 

555 

556  long  sizeof_col,  /*     sizes,    in  bytes  */ 

557  sizeof_int, 

558  sizeof _pivot ; 

559 

560  ticks       start, 

561  starti;         /*  another  start  */ 

562 

563  Pivot.Type     pivot; 

564 
565 
566 
567  /* ========        INITIALIZATION   WORK        ======== */ 

566 

569  for    (i   =  0;    i  <  MAX.EVENTS;    i++)    t[i]    =  0; 

570 

571  start   =  t[START_TIME]    =   clock(); 

572 

573 

574  #ifdef  TRANSPUTER 

575 

576  cubesize   =  CUBESIZE; 

577  dim  =  DIMENSION; 

576  initialize_hypercube(dim) ; 

579 

560   #else 
581 

582  cubesize   =    (int)   numnodesO; 

583  dim  =    (int)   nodedim(); 

584 

585  #endif 

586 

587  t[DATA_S0URCE]    =  me  =    (int)   mynodeQ; 

588  root  =    (int)   myhostO; 

589 

590  sizeof _int     =  (long)  sizeof (int); 

591  sizeof _pivot   =  (long)  sizeof (Pivot_Type) ; 

592 
593 

594  /*      BROADCAST  THE   SIZE(A)    

595  * 

596  *  All  node  processors  need  to  know  the  number  of  rows  and  columns  in 

597  *  the  matrix  A  [i.e.,  size(A)] .   A  broadcast  to  the  entire  cube, 

598  *  cubecast(),  is  used  to  achieve  this.   The  nodes  also  need  to  know 

599  *  whether  or  not  to  set  timing  on,  so  this  value  is  passed  too. 

600  * 

353 


gfppnode.c  

601  ♦/ 

602 

603  #ifdef  TRANSPUTER 

604 

605  cubecast(me,    dim,    (char  *)   km,  sizeof.int,    cubesize); 

606  cubecast(me,    dim,    (char  *)   *n,  sizeof_int,    cubesize); 

607  cubecast(me,    dim,    (char  *)   fctiming,    sizeof_int,    cubesize); 

608 

609  #else     /*   iPSC/2   */ 

610 

6ii  cubecast(me,    dim,    (char  *)   *m,  sizeof.int,    R0W_SIZE_TYPE) ; 

612  cubecast(me,    dim,    (char  *)   *n,  sizeof_int,    C0L_SIZE_TYPE) ; 

613  cubecast(me,    dim,    (char  *)   ttiming,    sizeof.int,    ARG_TYPE) ; 

614 

615  #endif   /*  TRANSPUTER  */ 

616 

617  sizeof_col   =    (long)    (m  *   sizeof (double)) ; 

616 
619 

620  /*   COLUMN  BUFFER  AND  COUNTER  

621  * 

622  *  The  column  buffer,  cbuf [] ,  will  be  used  to  hold  one  column  of  A  at 

623  *  a  time.   We  will  see  cbuf []  used  on  a  variety  of  occasions  when  we 

624  *  must  work  with  a  column  of  A.   Allocate  cbuf[]  and  determine  the 

625  *  number  of  columns  that  will  be  stored  locally  (my_cols) . 

626  * 

627  */ 

628  cbuf    =    (double   *)    malloc(sizeof _col) ; 

629 

630  for    (i   =  0;    i   <  n;    i++)    {   if    ((i  '/.  cubesize)    ==  me)   my_cols++;    } 

631 
632 

633  /*      ESTABLISH  LOCAL  A  

634  * 

635  *  Allocate  storage  space  for  this  node's  part  of  A  (it  is  called  A 

636  *  even  though  it  is  only  part  of  A) . 

637  */ 
638 

639  A  =  matalloc(m,   my.cols); 

640 

641  t [SETUP]    =   clockQ   -   start; 

642 

643  start   =   clockO; 

644 

645  receive_columns(dim,    me,    A,   n,    cbuf,   my_cols,    sizeof_col); 

646 

647  t[DISTRIB_C0LS]    =   clock()   -   start; 

648 
649 
650  /*      BEGIN   ITERATION   
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651 
652 
653 
654 
655 
656 
657 
656 
659 
660 
661 
662 
663 
664 
665 
666 
667 
666 
669 
670 
671 
672 
673 
674 
675 
676 
677 
678 
679 
680 
661 
682 
663 
664 
665 
686 
687 
666 
689 
690 
691 
692 
693 
694 
695 
696 
697 
696 
699 
700 


1.)  At  the  top  of  the  for()  loop  ee  have  just  completed  update_G(), 
so  the  local  candidate  lor  the  next  pivot  is  situated  in  np[0]. 
The  function  elect_next_pivot()  performs  a  series  oi  directional, 
exchange()s  so  that  all  local  candidates  compete  in  an  election 
process.   The  winner  is  np[0] . 

2.)  If  all  sent  Hell,  np[0]  contains  the  next  pivot.   This  informa- 

3.)  If  this  node  has  the  pivot  column  [if  (p[k]  ==  gc)] ,  it  must 

divide  everything  under  the  pivot  by  the  value  of  the  pivot  and 
distribute  the  column  to  all  other  nodes  (node  zero  sends  to  host). 

4.)  Finally,  this  node  must  perform  the  computations  across  the 

Gauss  Transform  area  for  the  local  portion  of  A.  The 
update_G()  function  also  locates  the  next  pivot  without  special 
expense.   Then  it  is  time  to  go  back  to  the  top  of  the  loop. 
/ 

start  =  clockO  ; 

for  (k  =  0;  k  <  (HIN(m.n));  k++)  { 

pivot,  id  =  k  '/,  cubes  ize; 
pivot. t  =  k; 

/*   know  id;  k  ==>  t;  need  s,  u  */ 

if  (pivot. id  ==  me)   locate_pivot(me,  cubesize,  A,  k,  ftpivot); 

cubecast_from(pivot .  id,  me,  dim,  (char  *)  Jtpivot,  sizeof  _pivot) ; 

if  (me  ==  0)  { 

starti  =  clock() ; 

#ifdef  TRANSPUTER 

send(root,  (char  *)  *pivot ,  6izeof_pivot ,  cubesize); 

#else  /*  iPSC/2  */ 

send(root,    (char  *)   Jtpivot,    sizeof _pivot,    PIV0T_TYPE); 

#endif  /*  TRANSPUTER  */ 

t[PIV0TS_T0_H0ST]    +=    (clockO   -   starti); 


swap_rows(A,    k,    pivot. s); 
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701 

702  starti   =   clockO; 

703 

704  if   (pivot. id  ==  ne)   { 

705 

706  do_pivot_column_arithmetic(A,    cbuf,   k,   ne,    cubesize); 

707  > 
708 

709  t[PCOL_ARITHMETIC]    +=    (clockO    -  starti); 

710 

711  starti  =  clockO; 

712 

713  cubecast_from(pivot . id,    me,   dim,    (char  *)    cbuf,    sizeof.col); 

714 

715  t[PC0L_DISTRIB]    +=    (clockO   -  starti); 

716 
717 

7is  if    (me   ==  0)    { 

719 

720  starti   =   clockO; 

721 

722  #ifdef  TRANSPUTER 

723 

724  submit(me,    dim,    (char   *)    cbuf,    sizeof_col,    cubesize); 

725 

726  #else     /*    iPSC/2   */ 

727 

728  submit(me,    dim,    (char  *)   cbuf,    sizeof_col,    PC0L_TYPE) ; 

729 

730  #endif   /*  TRANSPUTER   */ 

731 

732  t[PC0LS_T0_H0ST]    +=    (clockO   -   starti); 

733  } 
734 

735  starti   =   clockO; 

736  update_G(A,    cbuf,    cubesize,   k,   me,    n,   Jtpivot); 

737  t[UPDATING_G]    +=    (clockO   -   starti); 

738 

739  } 

740  /*  END   ITERATION    [for(k...)]    */ 

741 

742  t [ITERATION]    =   clockO   -   start; 

743 
744 

745  free(cbuf); 

746 

747  t[ST0P]    =   clockO; 

748 

749  if    (timing)    submit_timing_data(me,   dim); 

750 
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751  retura(SUCCESS) ; 

752  } 

753  /* =  =  =  =  = 


EOF     gfppnode.c 


*/ 
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==========    PROGRAM  INFORMATION 


i  / 

2 

3 

4 

5 

6 

7 


SOURCE 

VERSION 

DATE 

AUTHOR 

REMARKS 


gipcnode. c 

2.3 

17  September  1991 

Jonathan  E.  Hartman,  U.  S.  laval  Postgraduate  School 

See  gi.h. 


9 
10 
11 

12   # include   <math.h> 
13 

14  #ifdef  TRANSPUTER 

15 

16  #include  <conc.h> 

17 

is  #include  <matrix.h> 

19  #include  <macros.h> 

20  #include  <allocate.h> 

21  #include  <comm.h> 

22  #include  <generate.h> 

23  #include  <mathx.h> 

24  #include  <ops.h> 

25  #include  <timing.h> 

26 

27  #else 

26 

29  #include  M/usr/hartman/matlib/matrix.h" 

30  #include  "/usr/hartman/matlib/macros.h" 

31  #include  "/usr/hartman/matlib/allocate.h" 

32  ^include  "/usr/hartman/matlib/comm.h" 

33  #include  "/usr/hartman/matlib/generate.h" 

34  #include  "/usr/hartman/matlib/mathx.h" 

35  #include  "/usr/hartman/matlib/ops .h" 

36  #include  "/usr/hartman/matlib/timing.h" 

37  #endii 

38 

39  #include  "gf.h" 

40 

41  #ifdef  TRANSPUTER 

42 

43  Channel  *ic[(CUBESIZE  +   1)], 

44  *oc[(CUBESIZE  +   1)]; 

45 

46  tendif 

47 
46 

49  ticks   t[MAX_EVENTS]; 

50 
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51 

/* 

52 

* 

53 

* 

54 

* 

55 

* 

56 

* 

57 

* 

58 

* 

59 

* 

60 

*/ 

FUMCTIOK  DEFIMITIOM 


After  this  node  finds  its  candidate  for  next  pivot,  there  must  be  a 
comparison  with  all  other  nodes.   The  local  candidate  starts  in  np[0] . 
Direction-by-direction,  candidates  are  exchanged  and  the  winner  is 
positioned  in  np[0] .   If  there  is  a  tie,  the  candidate  from  the  smaller 
node  number  wins.   A  RAMK_DEFICIEMT  opponent  is  ignored  (the  local 
candidate  must  be  at  least  as  good).   In  the  end,  all  processors  have 
identical  entries  in  np[0] . 
*/ 

61 

62  #ifdef  PROTOTYPE 

63 

64  void   elect_next_pivot(int  me,    int   dim,    Pivot_Type   *np) 

65 

66  #else 

67 

66  void   elect_next_pivot (me,    dim,    np) 

69 

70  int  me , 

7i  dim; 

72  Pivot_Type  *np; 

73 

74  iendif 

75  { 

76  int  dir; 

77 

76     long  cubesize  =  pow2(dim), 

79  len      =  sizeof (Pivot_Type) ; 

80 
81 

82     for  (dir  =  1;  dir  <  (int)  cubesize;  dir  <<=  1)  { 

83 

84         if  (dir  !=  8)  { 

85 

86  directional_exchange(me,    dim,   dir,    (char   *)   *(np[l]), 

87  (char   *)   *(np[0]),    len); 

88  } 

89  else  { 

90 

91  if    ((me  '/.  8)    !=   0)    {  /*     we  don't   want   0  < — >   8   comm     */ 

92 

93  directional_exchange(me,    dim,   dir,    (char   *)   ft(np[l]), 

94  (char   *)   *(np[0]),    len); 

95  > 

96  } 
97 

98 

99 

100 
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101  if    (np[l].id    !=  RANK.DEFICIENT)    { 

102 

103  if    (fabs(np[l] .u)    >   f abs(np[0] .u))    { 

104 

105  np[0].id  =  np[l].id;         np[0]  .u     =  np[l]  .u; 

106  np[0].s     =  np[l].s;  np[0]  .t     =  np[l].t; 

107  } 

108  else  { 

109 

no  if    (fabs(np[l] .u)   ==  fabs(np[0] .u))    { 

in 

112  if  (np[l].id  <  np[0].id)  {  /*  smallest  breaks  tie  */ 

113 

114  np[0].id  =  np[l].id;  np[0].u     =  np[l]  .u; 

us  np[0].s     =  np[l].s;  np[0]  .t     =  np[l].t; 

116  } 

117  } 
116                                 } 

119 

120        }  /*  end  if  (np[l]  .id.  .  .  .)  */ 

121 

122     }  /*  end  for(dir)  */ 

123 
124 

125  /*  Since  there  is  no  direct  connection  between  nodes  0  and  8,  we  once 

126  *  again  destroy  the  beauty  and  generality  of  the  hypercube  so  that  we 

127  *  can  be  sure  that  0  and  8  have  the  best  candidate  for  pivot. 
126  */ 

129 

130     if  (dim  ==  4)  { 

131 

132        if  ((me  */.  8)  ==  0)  {  /*  Nodes  0000  and  1000         */ 

133 

134  directional_receive(me,    dim,    0x1,    (char  *)   np,    len) ; 

135  } 
136 

137        if  ((me  */.  8)  ==  1)  {  /*  lodes  0001  and  1001         */ 

138 

139  directional_send(me,    dim,    0x1,    (char  *)   np,    len); 

140  } 

141  } 

142  } 

143  /*  End  elect_next_pivot()  */ 

144 

145 

146  /*  This  is  only  the  first  part  of  this  file.  The  rest  would  be  similar  to 

147  *  gfppnode.c 

148  * 

149  * ============        EOF     gfpcnode.c        =========== */ 
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